Exercise Spell Checker
Afp0607
A Simple Spell Checker
Introduction
For the last AFP exercise, you will have to implement a simple spell checker.
The approach we follow is straightforward. We use one file containing a sorted list of
single words as our dictionary. Next, we we determine all words
that occur in the document that we want to spell check, and we report all words
from this document that are not part of the dictionary. One criterion on which
your implementation will be judged is how well it handles large input files (both
the dictionary and the document to spell check). It is your task to monitor the
run-time behavior of your application (time and space behavior) and to optimize
your code. For this, you will have to use the
profiling tools
that GHC offers you.
Resources
To be more precise about what is expected from your implementation, please have a look at
the
Project Gutenberg Wiki, a collection
of free ebooks. A word list (the dictionary) and a book have been selected from this
collection and these two documents will be used to judge your implementation.
- For the word list (dictionary) we use the Moby Word Lists by Grady Ward, and in particular, the file named
SINGLE.TXT. This file contains over 350,000 English words. On some systems, an alternative word list can be found at /usr/share/dict/words.
You are of course free to test your implementation using other text documents.
Requirements
The module (or modules) that you submit should have a main function (of type
IO ()) such that it
can be compiled to an executable. It should be possible to specify the word list and the document to
be checked by supplying parameters. You can use the function
getArgs from
System.Environment to
obtain the command-line arguments, or you could import
System.Console.GetOpt (the fancy way to
process command-line options).
This task description is deliberately imprecise about how the document should be checked. Some suggestions
follow:
- Do not take capital letters into account: instead, convert these letters to lower-case letters.
- The text will typically contain punctuation marks, quotations, etc. These symbols must be dealt with care.
- Ignore numbers that appear in the text.
- Report an unknown word only once by writing a string to the standard output device. Also report the line number of its first occurrence. Subsequent occurrences of the same word have to be ignored.
- If you apply different rules, please put a comment in your source.
There are several alternative ways to represent a dictionary: of course, the representation you choose
will have a major impact on the time and space behavior of your implementation. It is allowed to reuse
existing data structures, but you can also devise your own special purpose data type. Five operations that
must be supported (in both scenarios) are:
- creating an empty dictionary,
- inserting one element in a dictionary,
- removing one element from a dictionary,
- test whether a word is in the dictionary or not, and
- determining the size of the dictionary.
Please write
QuickCheck
properties that describe as precisely as possible how the five operations interoperate. Write one top-level
function that checks all your properties, and monitor the random test data to validate the results.
Submission
Your submission for this exercise should contain the following parts:
- Your Haskell module(s), including the function
main and the QuickCheck properties.
- A brief instruction how to compile your program (which flags, etcetera).
- A Time and Allocation Profiling Report (a .prof file) that clearly shows which parts of your program are most expensive. It should be clear which input files you used.
- A graph that shows the Memory usage of your program (a .ps file). It should be clear which input files you used.
- Do not send me any of the word lists nor the ebooks.
Bonus
What follows are optional features to excel in this exercise.
- After having reported all unknown words, present a summary that mentions the size of the dictionary and the text document, and that reports the number of unknown words found in the text.
- Implement the dictionary using the trie data structure. A trie is a data structure for storing strings in which there is one node for each common prefix.
- Suggest corrections for each unknown word using some kind of off-by-one rule.