You are here:
(27 Nov 2007,
---+ A Simple Spell Checker ---++ Introduction For the last AFP exercise, you will have to implement a simple spell checker. The approach we follow is straightforward. We use one file containing a sorted list of single words as our dictionary. Next, we we determine all words that occur in the document that we want to spell check, and we report all words from this document that are not part of the dictionary. One criterion on which your implementation will be judged is how well it handles large input files (both the dictionary and the document to spell check). It is your task to monitor the run-time behavior of your application (time and space behavior) and to optimize your code. For this, you will have to use the [[http://www.haskell.org/ghc/docs/latest/html/users_guide/profiling.html][profiling tools]] that GHC offers you. ---++ Resources To be more precise about what is expected from your implementation, please have a look at the [[http://www.gutenberg.org/wiki/Main_Page][Project Gutenberg Wiki]], a collection of free ebooks. A word list (the dictionary) and a book have been selected from this collection and these two documents will be used to judge your implementation. * For the word list (dictionary) we use the [[http://www.gutenberg.org/etext/3201][Moby Word Lists]] by Grady Ward, and in particular, the file named =SINGLE.TXT=. This file contains over 350,000 English words. On some systems, an alternative word list can be found at =/usr/share/dict/words=. * The document that we will spell check is [[http://www.gutenberg.org/dirs/etext94/arabn11.txt][The Arabian Nights Entertainments]] by Andrew Lang. This story consists of almost 12,000 lines. You are of course free to test your implementation using other text documents. ---++ Requirements The module (or modules) that you submit should have a main function (of type =IO ()=) such that it can be compiled to an executable. It should be possible to specify the word list and the document to be checked by supplying parameters. You can use the function =getArgs= from =System.Environment= to obtain the command-line arguments, or you could import =System.Console.GetOpt= (the fancy way to process command-line options). This task description is deliberately imprecise about how the document should be checked. Some suggestions follow: * Do not take capital letters into account: instead, convert these letters to lower-case letters. * The text will typically contain punctuation marks, quotations, etc. These symbols must be dealt with care. * Ignore numbers that appear in the text. * Report an unknown word only once by writing a string to the standard output device. Also report the line number of its first occurrence. Subsequent occurrences of the same word have to be ignored. * If you apply different rules, please put a comment in your source. There are several alternative ways to represent a dictionary: of course, the representation you choose will have a major impact on the time and space behavior of your implementation. It is allowed to reuse existing data structures, but you can also devise your own special purpose data type. Five operations that must be supported (in both scenarios) are: * creating an empty dictionary, * inserting one element in a dictionary, * removing one element from a dictionary, * test whether a word is in the dictionary or not, and * determining the size of the dictionary. Please write [[http://www.haskell.org/ghc/docs/latest/html/libraries/QuickCheck/Test-QuickCheck.html][QuickCheck]] properties that describe as precisely as possible how the five operations interoperate. Write one top-level function that checks all your properties, and monitor the random test data to validate the results. ---++ Submission Your submission for this exercise should contain the following parts: * Your Haskell module(s), including the function =main= and the [[http://www.haskell.org/ghc/docs/latest/html/libraries/QuickCheck/Test-QuickCheck.html][QuickCheck]] properties. * A brief instruction how to compile your program (which flags, etcetera). * A _Time and Allocation Profiling Report_ (a .prof file) that clearly shows which parts of your program are most expensive. It should be clear which input files you used. * A graph that shows the _Memory usage_ of your program (a .ps file). It should be clear which input files you used. * Do *not* send me any of the word lists nor the ebooks. ---++ Bonus What follows are optional features to excel in this exercise. * After having reported all unknown words, present a summary that mentions the size of the dictionary and the text document, and that reports the number of unknown words found in the text. * Implement the dictionary using the *trie* data structure. A trie is a data structure for storing strings in which there is one node for each common prefix. * Suggest corrections for each unknown word using some kind of _off-by-one_ rule.
ore topic actions
Topic revision: r2 - 27 Nov 2007,
Haskell 98 Report
Profiling with GHC
Growing a Language
FFI in Haskell
Center for ST
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding UUCS?