CS230 Lab: Hash tables

Lab: Hash tables

Goals:

  • Practice with Hashtables
  • Explore Java API
  • Make use of interfaces
  • Experiment with command-line arguments

Practice with hash tables: writing a spell checker

In this lab we will implement a program to do spell checking of text documents against a dictionary of words. Sume peepel reelie nead a spel czequer; sume peepel du knop.

A hash table is a very good data structure to hold a dictionary of words against which documents will be checked, because it provides for an efficient way to look elements up in it. (How efficient?)

Once the dictionary words are loaded into a hash table, the program will be reading consecutive words from the document to be checked. Each word will be looked up in the dictionary. If a word is not found in the dictionary, the word is considered a spelling error.

The dictionary will be read from a text file (on disk) into a hash table (in memory). Each word in the file serves as a key as we insert entries into the hash table. In this application, the object associated with the key is of no importance to us, so in your implementation it can be whatever you choose, but null. For example, it can be the same object as the key itself, or a number, or anything else.

Our program will provide the user with the flexibility to choose the file of words they want to use to build their dictionary. In addition, we will structure this application in such a way that it is easy to replace the hash table holding the dictionary with some other suitable data structure in the future.

In particular, we will separate the task of creating the dictionary from the task of spell checking the document. Implement the first task in a class named EnglishHashDictionary.java and the second task in a class named Speller.java. The Speller class should require minimal changes when a dictionary other than the EnglishHashDictionary is used.

Download folder lab11. The material there provides a starting point for your work. This folder contains the following files:

  1. Dictionary.java (the complete interface)
  2. EnglishHashDictionary.java (an incomplete implementation of the Dictionary interface)
  3. Speller.java (an incomplete client)
  4. three "EnglishWordsX.txt" files to use in building your English Language Dictionary
  5. several text files (end in .txt) to use for spell checking.

In particular:

  • Dictionary.java: Any class that defines a dictionary should implement this interface. In particular, any such class should provide a way to add a word to the dictionary, search the dictionary for a given word, and remove a word from it. Our EnglishHashDictionary class will, indeed, be no exception. Using this interface, the client code, like the Speller in our application, (see further down) becomes independent of the specifics in such implementation.
  • The file EnglishWords.txt contains about 25,000 words that make up the dictionary we will use today (EnglishWords1.txt contains about 40,000 words; EnglishWords2.txt contains about 80,000 words -- you can use any of these files, or try them all and see if you notice a speed difference).

    Notice that all text files contain some, but not all variations of a given word (e.g. plural, -ing or -ed forms of the words). Anything not in our dictionary file is considered a misspelling. This is clearly not ideal, but this is a preliminary spell-check program (think version 1.0).

  • The RomeoAndJuliet.txt, GreenEggs.txt and the rest .txt files contain English text to be spell checked. Of course, feel free to create your own text files for testing purposes!

Task 0

Look at the Java's Hashtable API. Inside the lab11Project you downloaded, add a class called HashDriver.java, and inside, add a main() method. Create a Hashtable with values and keys of your choice. Now explore some of the functionalities this class offers you:

  • Add a few entries.
  • Print the table.
  • Look for an entry which exists in the table and for anogther one that doesn't exist.
  • Print the results in an informative way.

Task 1: Back to Hashing...

Can you think of any other ADTs suitable for holding the dictionary of words in this application? What are some advantages and disadvantages of each one?

Task 2a: EnglishHashDictionary.java

Finish the implementation of the class EnglishHashDictionary.

In particular, the EnglishHashDictionary.java should implement the Dictionary Interface, specified in the Dictionary.java file. This program will read a text file containing the dictionary (a word per line) into a Hashtable object. Use Java's Hashtable<K,V> in this assignment. Since for this lab we only need to keep track of the keys K (the words in the dictionary), we do not need to worry much about the values V. We can use the words (String) both as keys (K) and as values (V).

Pay attention to the provided main() method to your file to test your code, and make sure it works well, before you proceed to the next step.

main()
public static void main(String[] args) {
 EnglishHashDictionary ehd = new EnglishHashDictionary("EnglishWords.txt");
 System.out.println("******Testing searchWord() *****");
 System.out.println("donut: " + ehd.searchWord("donut"));
 System.out.println("Adding donut to dictionary");
 ehd.addWord("donut");
 System.out.println("donut: " + ehd.searchWord("donut"));
 System.out.println("data: " + ehd.searchWord("data"));
 System.out.println("booger: " + ehd.searchWord("booger"));
 System.out.println("  ");
 }

Below is some sample output from the main() method in EnglishHashDictionary:

output
*************************
Dictionary EnglishWords.txt loaded containing 45427 words
*************************

******Testing searchWord() *****
donut: false
Adding donut to dictionary
donut: true
data: true
booger: false

Task 2b: Speller.java

Finish the implementation of the Speller class.

In particular, provide a method, checkSpelling(), which takes as an argument the name of a text file to be spell-checked. As spell-checking proceeds, any misspelled words should be added to some kind of "container". What kind of container could be suitable? Provide the means of printing the results on the standard output.

Write a main() method, that creates a dictionary and a speller, and spell-check a few text documents.

Note that our spell-checker has a hard time handling punctuation. Many of our provided .txt documents highlight flaws in our spell-checker. There is not much we can do about this in this lab, but we mention it just as food for thought for future versions of your spell checker...

Ideas for improvements:

If you have time, use command-line arguments in your program:

  • arrange so that the user can provide the name of the document to be checked as a command-line argument.
  • provide a version where the user can provide the name of the file containing the dictionary words as a second command line input.
  • What would it take to make the second argument above optional? (That it, if the user does not enter a second command-line argument, the program uses a default one, otherwise it uses what is specified by the user.)
  • Can you provide the way for the user to input multiple files names for spell-checking as command-line arguments?
  • How can you ignore punctuation in the words as they get checked for belonging in the Dictionary or not?