Search And Query
This project is concerned with searching and quering a GW wiki server. A search mechanism serves to easily find and retrieve information ("content") from a server. You can think of finding pages containing information about (or related to) particular topics, searching for keywords or text fragments. Quering is concerned with inspecting structural information of a wiki server. Examples are: finding "dead" pages (i.e., pages to which no references exist), graphically displaying the structure of pages and references, displaying the most frequently visited pages.
A list of the requirements for Search And Query can be found at SearchAndQueryRequirements
- Users point of view:
- Interfacing the user (frontend) Users should be able to express simple and complex searches and perform different queries. The user interface therefor forms an important part of the query/search mechanism. The user interface is handled by the rendering servlet
- query/search expression language To let the user perform searches/queries, a language is needed to express queries and searches. This language should be able to handle more complex searches than just single word searches.
- result processing language The result of a query/search should be presented to the user. This means that the reslt should be processed and transformed to a wiki page. Processing might involve, for example, iterating a result list to add text, filtering the list, or even joining the lists of different queries/searches.
- Server point of view:
- interfacing the content (backend) The data that is searched or queried needs to be accessed. This project therefore depends on the data representation of the GW wiki. Since GW wiki will use subversion for data storage (which is done by the Storage management group), we cannot simply access regular files but we have to interface subversion somehow.
- indexing Querying and searching should be FAST. Therefore we cannot just perform a search on plain files. Instead, an indexing mechanism is needed that is able to quickly find the requested information. A complicating factor is that a GW wikie site evolves (i.e., the content is not static). Hence, an index can become outdated and needs to be regenerated.
- cached searches/queries Even with the use of indexes will searching/querying be costly. The GW wikie search/query mechanism should therefore be able to cache results, such that subsequent queries no longer have to be executed (unless the cached results are outdated).
- PluggableFactory - Study on possibilities of plugins
- IndexingWiki - Study on what should be indexed
- SqArchitecture? - Design of the architecture used within Search And Query
- search/query expression language as provided by Lucene
- presentation of search results which will be done by generating GWiki Language which is then be passed on to the general renderer.
- Indexed Search (using Lucene) See also http://www.cs.uu.nl/groups/ST/Gw/TechnicalDocumentation
- Possible Improvements:
- Caching of results. this may conflict with access control mechanism
- Minimal requirements:
- topic search Find the GW web for a specific topic, or pages containing links to that topic, or the pages referenced from the topic.
- full text search Find pages containning particular text phrases
- constrained search This form of searching is restricted to pages meeting specific criteria. For instance, only search pages of a particular category, all pages but my own, all pages not changed since last week.
- If GWiki Language will support:
- keyword search Find the GW web for pages containing ordinary words (in/excluding wiki topics)
- category search Catagories are a way to group the information of a GW web. For instance, pages about the GW web may be categorized as "Generalized Wiki". category search finds all pages of a particular category.
- search in search If the result of a query/search is to large
- multi-web search This enables a search through the complete GW wiki directory structure instead of one or more directory subtrees.
- querying web structure e.g., # topics, # of directories, # users, ...
- dead pages Find pages to which no references exist
- live pages Find pages to which references exist
- new pages Find pages that are new since a particular date
- most recently changed pages Produce a list of topics that have most recently be changed. By default you can produce a list of the top-ten most recently changed topics. On request a list of all topics, all topics of a particular category etc. is produced.
- most visited pages Produce a list of topics that have recently be visited
- page readers/visitors Produce a list of users who have visited a particular topic (ever since, or since a particluar moment)
- page writers Produce a list of users who have contributed to particular topic (ever since, or since a particluar moment)
- changes produce a list of recent changes of a topic. Produce a list of topics within a category/subweb that have been changed
- page and reference structure Display the graph structure of the GW web and all links between topics
- web site activity browser Visualize the activity of a GW web. This gives insight into which parts are actively being modified/accessed and which parts are (almost) dead.
- web counters/statistics Implement web counters that tell how often pages are visited/edited etc. Use the different queries to provide statistic pages per topic.
- Redundancy Wiki web-sites may easily contain redundant information. The reason is that since anyone can contribute it not always obious to detect wether information is already available. An analysis helping to detect redundancy might be extremely helpful in maintaining large wiki web sites.
Transcripts of our meetings: SearchAndQueryLogs
- 13 Sep 2004
- 23 Sep 2004