Indexing Wiki
Gw
Introduction
We will be indexing the wiki in order to search more quickly. Its quite obvious that indexing the wiki and searching the index is pretty much faster (if implemented correctly of course) than just brute force searching through all files in the wiki. Now we have to create an index or even several indexes. Below we have described several methods of indexing. Also there is a section Preferred Implementation where any one can (and should) post their opinions about which method we should use.
Lucene
Lucene is the indexing mechanism which is going to be used by SAQ for the GW.
Basically Lucene is a Java library which does text indexing and searching. Lucene offers two main services: text indexing and text searching. Both of these services are important for the GW. Supposedly users will edit files on the GW, and those files then have to be indexed (or reindexed). Also, the whole point of using Lucene is to speed up the searches. Several benchmarks are available for lucene and can be viewed at the main site. Lucene is a well known system and thus there exist also a lot of extra modules.
Lucene creates an index from "documents". This arent normal documents as in a textfile or a worddocument, its a class. This class contains several fields, one of these fields being the content of the file. Any arbitrary number of fields can be added to a document, this is usefull for extra data we might want to add to a document (see indexing methods).
Indexing
General
Below are three different indexing methods. Some information though is always needed. Whatever method we use we always want to store a pathname (or sort like unique key) by which we can identify the file we just hit.
Also, we need to make a design decision: what do we do with authorization. Only the method where we index by usergroup isnt involved in this, but the other two methods are: how do we check authorization? There are 2 possible ways to do this.
- Username/group field
- Ask storage/secure storage
If we have a field we can easily filter the allowed sites by doing a constrained search automatically. But this also means that we need to update the index everytime someone changes access to a certain page and when someone gets added to or deleted from a certain usergroup. On the otherhand, if we have to ask the storage for every result if the user is allowed to view it our search might be very slow. For now we will just assume we will use the first method: using a username/ usergroup field.
This means whatever method we use our documents will at least contain the following fields:
Fields:
- Path
- Username / Usergroup
Indexing the Whole Wiki
Indexing the whole Wiki is just what it sais: we will generate one big index that indexes everything in the wiki. We then need to have several fields to be able to do good constrained search. For instance, we will want to be able to do a multi-web or single-web search. For this, we need to be able to define those webs, and thus we need a subwiki field.
Fields:
Indexing Subwiki's
We will not generate a single index for the whole wiki, instead we will generate an index for every subwiki. Now we have several index files, and we can just select the subwiki we want to search for. Though, when we need to search through the whole wiki we need to just look in every subwiki index.
Fields:
- No extra fields are involved
Indexing Usergroups
With the other methods we do index every file into the index, even files a user might not be authorised to view. A neater way would be to not index those files for this user at all. So we will generate an index for every usergroup. We would then not need any usergroup information to go in the index. But still we would need some user information, because it might be possible that a single user does not have access to a certain file while the rest of his usergroup still does. We cannot create an index file for every username: that would mean way to many indexes.
Fields:
Preferred Implementation
| Indexing |
| Whole Wiki | Subwikis | Usergroups |
| Pros | Cons | Pros | Cons | Pros | Cons |
| Fast multi-web search | Indexing files which cant be seen | Fast single web search | Slow multiweb search | Might be neater | Slow indexing |
| Fast updating | Might become huge file | Several small files | Cannot guarantee best results on top | | Many indexes, which we might have to look through (user can be in several groups) |
| Still supports authorization | | Information stored implicitly | Depends on notion of subwikis | | |
So currently we will be using:
Indexing the whole wiki
--
RaymonVanWanrooij - 28 Sep 2004