Alexis Brunet

CUSEC* 2009 IBM Extreme Blue Programming competition.

Program a spellchecker with suggestions for misspelled words, then go have a beer. It's CUSEC Pub Night!
* CUSEC = Canadian University Software Engineering Conference

Enrique and I spent a few minutes exercising our Google-fu and found the following article on IBM's website. http://www.ibm.com/developerworks/java/library/j-jazzy/

Here's what we got from the article:
  1. Soundex returns too many suggestions to be useful by itself.
  2. Levenshtein Distance Metric is too slow to perform on the entire dictionary every time we find a misspelled word.

We decide to hash the entire dictionary with each word's soundex value as the key. Then soundex the misspelled word and retrieve the list of dictionary words which have the same soundex value.

We then calculate the Levenshtein distance on this sub-set of dictionary words and return the ones with the smallest distance from the misspelled word.

We got our dictionary file by doing cat /usr/share/dict/words > dictionary.txt

Clearly this solution isn't perfect, first as the article above points out, Soundex has some limitations. Second it assumes that the misspelled word is mostly correct (small Levenshtein distance).

This is by no means a production level piece of code (seriously, just use a library, Jazzy perhaps ;) but for a programming competition we achieved the goal we had set out for ourselves, which was to code a working solution without using libraries (and that was not a Google scraper!!) in the time allotted.

So we ended up with a little CLI java program that reads a text file and outputs the misspelled words, their positions and some suggestions.

You can browse the source on GitHub.