CS 112

Assignment 7

Due: Friday, April 17 at 10:00am

You can turn in your assignment up until 10:00am on 4/17/15. You should hand in both a hardcopy and electronic copy of your solutions (it is ok to hand in your hardcopy during lab on Friday, but the electronic copy should be submitted by 10:00am). Your hardcopy submission should include printouts of 6 code files, reduceImage.m, testReduce.m, getLanguageInfo.m, findLanguage.m, chiSquare.m, and testSentences.m. Your electronic submission is described in the section Uploading your saved work. (If you'd like to save paper, you can cut and paste all of your code files into one script, but your electronic submission should contain the separate files.)

Reading

The following material in the Gilat text is useful to review for this assignment: 53-55, 103-110. You should also review notes and examples from Lectures #17, 18 and 19, and Lab #9.

Getting Started: Download assign7_exercises and assign7_problem

Use Fetch or WinSCP to connect to the CS server using the cs112d account and download a copy of the assign7_exercises and assign7_problem folders from the cs112d directory onto your Desktop.

The assign7_exercises folder contains a subfolder of images for the exercise, and a file named party.txt that you will work with in lab. The assign7_problem folder contains three files for the problem in this assignment, testSentences.m, frequencyInfo.txt, and sentences.txt.

Uploading your saved work

When you are done with this assignment, your assign7_exercises folder should contain a new subfolder named thumbs with a set of thumbnail images created in the exercise, and two code files: reduceImage.m and testReduce.m. Your assign7_problem folder should at least contain the following code files: getLanguageInfo.m, findLanguage.m, chiSquare.m, and testSentences.m.

Use Fetch or WinSCP to connect to your personal account on the CS file server and navigate to your cs112/drop/assign7 folder. Drag your assign7_exercises and assign7_problem folders to this drop folder. More details about this process can be found on the webpage on Managing Assignment Work.

Exercise: Creating Thumbnails for a Folder of Images

This exercise uses a set of MATLAB functions related to managing files and directories that you will learn about in lab: pwd, filesep, dir, and mkdir. You can also explore these functions in the MATLAB Help system. Before starting, set the Current Directory to the assign7_exercises folder.

Write a function named reduceImage that has two inputs: (1) a matrix that stores an image and (2) a scale factor. This function should have a single output that is a new image matrix. The output image should be a reduced version of the input image, created by sampling the rows and columns at regular intervals given by the input scale factor. For example, given the following function call:

smallMatrix = reduceImage(originalMatrix, 4);

the resulting image should be one fourth the size of the original image, created by sampling every 4th row and column of it. The reduceImage function can be very short!

Write a script named testReduce.m that tests your reduceImage function for a set of images stored in the images subfolder that is contained inside the assign7_exercises folder. This script should use the dir function to create a listing of all of the files stored in the images subfolder that have a filename extension of .jpg. It should then create a new folder named thumbs in the assign7_exercises folder. Finally, it should loop through all the image files and perform the following three steps for each one:

  1. read the image from the images folder into the MATLAB workspace,
  2. reduce the image by a factor of 4, and
  3. store the reduced image in the thumbs folder (using the imwrite function).

Problem: Parlez-vous Francais?

The frequency of occurrence of different letters of the alphabet varies across languages and can be used to identify the language in which a particular selection of text is written. The following tables show some data on the frequency of occurrence (listed as percentages) of the nine most common letters in six languages:

  English  German  Finnish   French  Italian  Spanish
e12.31
t9.59
a8.05
o7.94
n7.19
i7.18
s6.59
r6.03
h5.14
e18.46
n11.42
i8.02
r7.14
s7.04
a5.38
t5.22
u5.01
d4.94
a12.06
i10.59
t9.76
n8.64
e8.11
s7.83
l5.86
o5.54
k5.20
e15.87
a9.42
i8.41
s7.90
t7.26
n7.15
r6.46
u6.24
l5.34
e11.79
a11.74
i11.28
o9.83
n6.88
l6.51
r6.37
t5.62
s4.98
e13.15
a12.69
o9.49
s7.60
n6.95
r6.25
i6.25
l5.94
d5.58

In this problem, you will complete a program that uses the above data to identify the language in which a sentence is written. The assign7_problem folder contains a file frequencyInfo.txt with the frequency information in the above table for the six languages. It also contains a script file testSentences.m with initial testing code for your program. This script first constructs six strings of letters that are not real sentences, but just contain the right proportion of the most common letters for each language. These strings are for testing whether your code is working ok. The script calls the findLanguage function that you will write, with each of these test sentences as input. To complete this program, you will first write three functions named getLanguageInfo, findLanguage and chiSquare, described below. You will then add code to the testSentences.m script to identify the language for a set of real test sentences stored in the sentences.txt text file.

getLanguageInfo.m

The getLanguageInfo function should have no inputs and two outputs. It should read the information stored in the frequencyInfo.txt file and return two cell arrays. The first cell array should contain the names of the six languages, obtained from the first line of the text file (this should be a single cell array of strings, and not a cell array that is embedded in another cell array). The second cell array should contain the most common letters and their associated frequencies of occurrence for the six languages, obtained from the remaining contents of the text file. The built-in textscan function can be used with a format string that contains 12 format specifiers (%s or %f), to load the remaining contents of the text file into a cell array that has 12 elements that alternate between cell arrays of frequent letters and vectors of frequencies for the six languages.

findLanguage.m

The findLanguage function should have a single input that is a string of words in a particular language, and should return two values: (1) the name of the language, and (2) the chi-squared statistic described below, which captures how well the sentence fits the pattern expected for the chosen language. This function should call your getLanguageInfo function to get the language names and frequency data for the six languages. For each language, the findLanguage function should first count the number of occurrences in the input string, of the 9 most common letters for the language. From these counts, you can then determine the frequency of occurrence of each of these 9 letters in the input string. The observed frequencies of occurrence can be compared to the expected frequencies for each language, to determine how well the input sentence fits the expected data for the language. This last step can be accomplished by calculating the Χ2 (Chi-Squared) statistic between the observed and expected frequencies, described in the next section. The most likely language for the input sentence is the one with the smallest value for the Χ2 statistic.

chiSquare.m

The chiSquare function should have two inputs corresponding to the observed and expected frequencies for the 9 most common letters in a language, and should return the value of the Χ2 (Chi-Squared) statistic. To calculate this statistic, suppose you are given a vector E of the expected frequencies of occurrence of particular events (in this case, the appearance of certain letters in a text string) and a second vector O that contains the observed frequencies of occurrence. The Χ2 statistic captures the difference between O and E, and is measured as follows:

Χ2 = Σ (Oi - Ei)2/Ei

where the sum is taken over the set of frequencies.

testSentences.m

When you have completed the definitions of your getLanguageInfo, findLanguage and chiSquare functions, run the testing code in the testSentences.m script to identify the languages for the test strings defined there. The findLanguage function has two outputs, but in the testing code that is provided, the call to the findLanguage function is embedded directly in a disp statement that only uses the first output of this function, e.g.:

disp(['sentence: englishTest   language: ' findLanguage(englishTest)]);

The file sentences.txt file contains a real sentence from each of the 6 languages. Add code to the testSentences.m script that first reads the 6 sentences from the file, creating a cell array in which each sentence is stored as a separate string. Call the findLanguage function on each sentence, with output variables to store both the language and chi-squared statistic, and write the results into a new text file named results.txt. When complete, the new text file should have the following content and format (the numbers are the chi-squared values):

Spanish   19.19   entre broma y broma, la verdad se asoma 
German    36.33   viele koche verderben den brei 
Finnish   21.68   ihmetapauksiin voi toivoa mutta ala luota niihin 
French    13.03   il faut battre le fer pendant qu'il est chaud 
English   10.55   this great english sentence is too long and trite 
Italian   10.26   con piacere colgo l'occasione di ringraziare tutti gli amici italiani che con i loro consigli