Exercise Spell Checker

Afp0607

A Simple Spell Checker

Introduction

For the last AFP exercise, you will have to implement a simple spell checker. The approach we follow is straightforward. We use one file containing a sorted list of single words as our dictionary. Next, we we determine all words that occur in the document that we want to spell check, and we report all words from this document that are not part of the dictionary. One criterion on which your implementation will be judged is how well it handles large input files (both the dictionary and the document to spell check). It is your task to monitor the run-time behavior of your application (time and space behavior) and to optimize your code. For this, you will have to use the profiling tools that GHC offers you.

Resources

To be more precise about what is expected from your implementation, please have a look at the Project Gutenberg Wiki, a collection of free ebooks. A word list (the dictionary) and a book have been selected from this collection and these two documents will be used to judge your implementation.

  • For the word list (dictionary) we use the Moby Word Lists by Grady Ward, and in particular, the file named SINGLE.TXT. This file contains over 350,000 English words. On some systems, an alternative word list can be found at /usr/share/dict/words.

You are of course free to test your implementation using other text documents.

Requirements

The module (or modules) that you submit should have a main function (of type IO ()) such that it can be compiled to an executable. It should be possible to specify the word list and the document to be checked by supplying parameters. You can use the function getArgs from System.Environment to obtain the command-line arguments, or you could import System.Console.GetOpt (the fancy way to process command-line options).

This task description is deliberately imprecise about how the document should be checked. Some suggestions follow:

  • Do not take capital letters into account: instead, convert these letters to lower-case letters.
  • The text will typically contain punctuation marks, quotations, etc. These symbols must be dealt with care.
  • Ignore numbers that appear in the text.
  • Report an unknown word only once by writing a string to the standard output device. Also report the line number of its first occurrence. Subsequent occurrences of the same word have to be ignored.
  • If you apply different rules, please put a comment in your source.

There are several alternative ways to represent a dictionary: of course, the representation you choose will have a major impact on the time and space behavior of your implementation. It is allowed to reuse existing data structures, but you can also devise your own special purpose data type. Five operations that must be supported (in both scenarios) are:

  • creating an empty dictionary,
  • inserting one element in a dictionary,
  • removing one element from a dictionary,
  • test whether a word is in the dictionary or not, and
  • determining the size of the dictionary.

Please write QuickCheck properties that describe as precisely as possible how the five operations interoperate. Write one top-level function that checks all your properties, and monitor the random test data to validate the results.

Submission

Your submission for this exercise should contain the following parts:

  • Your Haskell module(s), including the function main and the QuickCheck properties.
  • A brief instruction how to compile your program (which flags, etcetera).
  • A Time and Allocation Profiling Report (a .prof file) that clearly shows which parts of your program are most expensive. It should be clear which input files you used.
  • A graph that shows the Memory usage of your program (a .ps file). It should be clear which input files you used.
  • Do not send me any of the word lists nor the ebooks.

Bonus

What follows are optional features to excel in this exercise.

  • After having reported all unknown words, present a summary that mentions the size of the dictionary and the text document, and that reports the number of unknown words found in the text.

  • Implement the dictionary using the trie data structure. A trie is a data structure for storing strings in which there is one node for each common prefix.

  • Suggest corrections for each unknown word using some kind of off-by-one rule.