CS230 Lab: Hash tables

Lab: Hash tables

Practice with hash tables: writing a spell checker
 

Solutions

 

Writing your own spell checker

In this lab we will implement a program to do spell checking of text documents against a dictionary of words. Sume peepel reelie nead a spel czequer; sume peepel du knop.

[START PRELAB TASK 0]

 

Task 0

Let's make a mini-hash table! Here's Java's Hashtable API.

 import java.util.*;

 Hashtable<String, String> dates = new Hashtable<String, String>(10); // 10 initial spots
 // populate the table
 dates.put("Commencement", "6/01/2018");
 dates.put("Last Day of Classes", "5/11/2018");
 dates.put("Marathon Monday", "4/16/2018");

What happens if you execute the following lines of code?

  • System.out.println(dates);

      // in what order will the dates print?
    
    
  • String str = dates.get("Commencement");

    
    
  • System.out.println((dates.containsKey("Last Day of Classes")));

    
    
  • System.out.println((dates.containsKey("Sam's birthday")));

    
    
  • for (int i=0; i<=20; i++ ) {
       dates.put("Important Day"+i, "4/"+ i +"/2018");
    }

    // will there be enough room in the hash table?
    
    

A hash table is a very good data structure to hold a dictionary of words against which documents will be checked, because it provides for an efficient way to look elements up in it. (How efficient?)

Once the dictionary words are loaded into a hash table, the program will be reading consecutive words from the document to be checked. Each word will be looked up in the dictionary. If a word is not found in the dictionary, the word is considered a spelling error.

The dictionary will be read from a text file (on disk) into a hash table (in memory). Each word in the file serves as a key as we insert entries into the hash table. In this application, the object associated with the key is of no importance to us, so in your implementation it can be whatever you choose, but null. For example, it can be the same object as the key itself, or a number, or anything else.

Our program will provide the user with the flexibility to choose the file of words they want to use to build their dictionary. In addition, we will structure this application in such a way that it is easy to replace the hash table holding the dictionary with some other suitable data structure in the future.

In particular, we will separate the task of creating the dictionary from the task of spell checking the document. Implement the first task in a class named EnglishHashDictionary.java and the second task in a class named Speller.java. The Speller class should require minimal changes when a dictionary other than the EnglishHashDictionary is used.

The folder you downloaded provides a starting point for your work. This folder contains the following files:

  1. Dictionary.java
  2. EnglishWords.txt
  3. EnglishWords1.txt
  4. EnglishWords2.txt
  5. RomeoAndJuliet.txt, GreenEggs.txt (and other text files)

In particular:

Any class that defines a dictionary should implement this interface. In particular, any such class should provide a way to add a word to the dictionary, search the dictionary for a given word, and remove a word from it. Our EnglishHashDictionary class will, indeed, be no exception. Using this interface, the client code, like the Speller in our application, (see further down) becomes independent of the specifics in such implementation.

  • The file EnglishWords.txt contains about 25,000 words that make up the dictionary we will use today (EnglishWords1.txtcontains about 40,000 words;EnglishWords2.txt contains about 80,000 words -- you can use any of these files, or try them all and see if you notice a speed difference).

    Notice that both files contain some, but not all variations of a given word (e.g. plural, -ing or -ed forms of the words). Anything not in our dictionary file is considered a misspelling. This is clearly not ideal, but this is a preliminary spell-check program (think version 1.0).

  • The file EnglishHashDictionary.java will provide the definition of a dictionary based on a Hash Table.

  • The RomeoAndJuliet.txt and GreenEggs.txt files contain English text to be spell checked. Of course, feel free to create your own text files for testing purposes!

[END PRELAB]

 

Task 1

Can you think of any other ADTs suitable for holding the dictionary of words in this application? What are some advantages and disadvantages of each one?

 

Task 2a

Implement the class EnglishHashDictionary.

In particular, the EnglishHashDictionary.java should implement the Dictionary Interface, specified in the Dictionary.java file. This program will read a text file containing the dictionary (a word per line) into a Hashtable object. Use Java's Hashtable implementation in this assignment.
Add a main method to your file to test your code, and make sure it works well, before you proceed to the next step.

Here is some sample output from the main method in EnglishHashDictionary:

main()
public static void main(String[] args) {
 EnglishHashDictionary ehd = new EnglishHashDictionary("EnglishWords.txt");
 System.out.println("donut: " + ehd.searchWord("donut"));
 System.out.println("Adding donut to dictionary");
 ehd.addWord("donut");
 System.out.println("donut: " + ehd.searchWord("donut"));
 System.out.println("data: " + ehd.searchWord("data"));
 System.out.println("booger: " + ehd.searchWord("booger"));
 System.out.println("  ");
 }
output
*************************
Dictionary EnglishWords.txt loaded containing 45427 words
*************************

donut: false
Adding donut to dictionary
donut: true
data: true
booger: false

 

Task 2b

Implement the Speller class. In particular, provide a method, checkSpelling(), which takes an argument: the name of a text file to be spell checked. As spell-checking proceeds, any misspelled words should be added to some kind of container, and printed on the standard output at the end.

As before, write a main() method, that creates a dictionary and a speller, and spell check a few text documents (there are a couple provided for you, even better, make up or use your own documents). Some of the documents available: Royals.txt, Roar.txt, GreenEggs.txt, sonnet116.txt, AllYouNeedIsLove.txt, DeclarationIndependence.txt, YouAreTheMusicInMe.txt. (Note that our spell checker has a hard time handling punctuation. Many of our sample documents highlight flaws in our spell checker, just food for thought for future versions of your spell checker...)

Arrange your program so the user can provide the name of the file to be checked as a command-line argument (as shown in the examples below). Also, if they wish so, they should be able to provide the file containing the dictionary words as a second command line input. If they do not wish so, the program should use a default such file.