Storage Management Requirements
Gw
Advisor/Customer:
Eelco Dolstra
Most current Wikis are based on RCS or CVS. This has some
limitations, the main one being that files (Wiki pages), not
directories, are versioned. This means that:
- It is not possible to rename or move files, refactor the directory structure, etc.
- Wikis generally do not support a recursive directory structure.
- Commits are per-file. So changes to sets of files are logically distinct, i.e., not atomic. So if you want to make related changes to a bunch of pages, the Wiki may be in an inconsistent state while you're making the changes, and it's hard to undo (back out) the changes.
Since Subversion solves all these problems, we will use it as the
storage back-end of GW. An entire Wiki will be stored in a single
Subversion repository.
The goal of the storage team is to implement the Subversion storage
back-end. I envision the following milestone releases (subject to
change):
Release 0 - initial GW
On startup GW does a checkout of the repository. Edits happen on the
working copy. However, there are no commits, so all changes to the
Wiki are lost when the server is restarted. There is no locking of
any kind.
Release 1 - persistent storage
Save operations should cause a commit to happen. When edits a new
page, the page should be added first of course. Still no locking
though.
R2 - multi file edits ("transactions"):
When a user starts editing a file, create a new working copy.
Edits happen on this per-session working copy and are not visible to
other users. So a "Save" causes the per-session working copy to be
modified, but nothing is committed.
There should now be a "Commit" operation that causes the entire
per-session working copy to be committed. After this the per-session
working copy can be deleted, and the global working copy should be
updated. This makes the changes globally visible.
Merge conflicts are ignored for now.
R3 - merge conflicts
If on commit a merge conflict occurs, the per-session working copy
should be retained, the user should be presented with a list of
conflicting pages (showing the conflicts in those pages) and be
allowed to edit those pages. Most of the work here should be done by
the versioning UI team, but the storage team has to provide the
supporting infrastructure.
R4 - use RA layer
Using working copies is inefficient. For instance, to start an edit
session, we have to clone the entire working copy. This doesn't scale
well. So instead of using Subversion's working copy (WC) layer, we
should use the remote access (RA) layer, which allows us to fetch and
edit just those files that are involved in an operation.
It's possible that the high-level Subversion bindings only supports WC
operations, not RA operations. So additional C/Java bindings might
have to be created.
R5 - caching for RA operations
Using the RA layer is scalable, but it's also slow. For instance, to
view a page, we have to fetch it from the Subversion server every time
(while before R4, we could just get it from the working copy). So RA
fetches should be cached.
Of course, the cache should be properly invalidated on edit
operations.
Additional complications
The storage layer is quite fundamental. The entire Wiki depends on
it. The storage team should design a simple but sufficient interface
to the storage layer that other teams can develop against. In
particular the work of versioning UI team is closely related to that
of the storage team. For instance, for merge support it is necessary
that the storage layer offers to the upper layers notification that
there is a merge conflict, a way to query what the conflicts are, and
a way to clear the conflict situation. Close communication and
frequent syncing with the versioning UI team is probably required.
Big bang integration is not an option.
Maybe R4/R5 aren't such a good idea since replicating a lot of the
functionality in the WC layer (such as support for moving files) is a
lot of work. However something should be done about the scalability
problem in R3. Alternatives might be to clone working copies using
hard or symbolic links, lazily cloning the working copy, and so on.