![]() |
Assignment 7
|
|
Due: Thursday, April 18, by 5:00pm
You can turn in your assignment up until 5:00pm on 4/18/19. You should
hand in both a hardcopy and electronic copy of your solutions. Your hardcopy
submission should include printouts of 6 code files,
reduceImage.m, testReduce.m, getLanguageInfo.m, findLanguage.m, chiSquare.m,
and testSentences.m
. To save paper, you can cut and paste all of your code
files into one script, but your electronic submission should contain the separate files.
Your electronic submission is described in the section How to turn in this
assignment.
Reading
The following material in the Gilat text is useful to review for this assignment: 53-55, 103-110. You should also review notes and examples from Lectures #17, 18 and 19, and Lab #9.
Getting Started: Download assign7_exercises and assign7_problem
Use Cyberduck to download a copy of the assign7_exercises
and
assign7_problem
folders from the download folder.
The assign7_exercises
folder contains a subfolder of images for the exercise,
and a file named party.txt
that you will work with in lab. The
assign7_problem
folder contains three files for the problem in this assignment,
testSentences.m, frequencyInfo.txt,
and sentences.txt
.
Uploading your saved work
When you are done with this assignment, your
assign7_exercises
folder should
contain a new subfolder named thumbs
with a set of thumbnail images created
in the exercise, and two code files: reduceImage.m
and testReduce.m
.
Your assign7_problem
folder should at least contain the following code files:
getLanguageInfo.m, findLanguage.m, chiSquare.m,
and testSentences.m
.
Use Cyberduck to connect to the CS file server and navigate
to your cs112/drop/assign07
folder. Drag your assign7_exercises
and assign7_problem
folders to this drop folder. More details about this process
can be found on the webpage on Managing Assignment Work.
Exercise: Creating Thumbnails for a Folder of Images
This exercise uses a set of MATLAB functions related to managing files and directories
that you will learn about in lab: pwd, filesep, dir,
and mkdir
.
You can also explore these functions in the MATLAB Help system. Before starting, set
the Current Directory to the assign7_exercises
folder.
Write a function named reduceImage
that has two inputs: (1) a matrix that
stores an image and (2) a scale factor. This function should have a single output that is a
new image matrix. The output image should be a reduced version of the input image,
created by sampling the rows and columns at regular intervals given by the input scale
factor. For example, given the following function call:
smallMatrix = reduceImage(originalMatrix, 4);
the resulting image should be one fourth the size of the original image,
created by sampling every 4th row and column of it
. The reduceImage
function can be very short!
Write a script named testReduce.m
that tests your reduceImage
function for a set of images stored in the images
subfolder that is
contained inside the assign7_exercises
folder. This script should use the
dir
function to create a listing of all of the files stored in the images
subfolder that have a filename extension of .jpg
. It
should then create a new folder named thumbs
in the assign7_exercises
folder. Finally, it should loop through all the image files and perform the following
three steps for each one:
- read the image from the
images
folder into the MATLAB workspace, - reduce the image by a factor of 4, and
- store the reduced image in the
thumbs
folder (using theimwrite
function).
Problem: Parlez-vous Francais?
The frequency of occurrence of different letters of the alphabet varies across languages and can be used to identify the language in which a particular selection of text is written. The following tables show some data on the frequency of occurrence (listed as percentages) of the nine most common letters in six languages:
English | German | Finnish | French | Italian | Spanish | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
In this problem, you will complete a program that uses the above data to identify
the language in which a sentence is written. The assign7_problem
folder contains a file frequencyInfo.txt
with the frequency information in
the above table for the six languages. It also contains a script file testSentences.m
with initial testing code for your program. This script first constructs six strings of letters that
are not real sentences, but just contain the right proportion of the most common letters for each
language. These strings are for testing whether your code is working ok. The script calls
the findLanguage
function that you will write, with each of these test sentences as
input. To complete this program, you will first write three functions named getLanguageInfo,
findLanguage
and chiSquare
, described below. You will then add code to the
testSentences.m
script to identify the language for a set of real test sentences
stored in the sentences.txt
text file.
getLanguageInfo.m
The getLanguageInfo
function should have no inputs and two outputs. It should
read the information stored in the frequencyInfo.txt
file and return two cell
arrays. The first cell array should contain the names of the six languages, obtained from the
first line of the text file (this should be a single cell array of strings, and not a cell
array that is embedded in another cell array). The second cell array should contain the most
common letters
and their associated frequencies of occurrence for the six languages, obtained from the
remaining contents of the text file. The built-in textscan
function can be used
with a format string that contains 12 format specifiers (%s
or %f
),
to load the remaining contents of the text file into a cell array that has 12 elements that alternate
between cell arrays of frequent letters and vectors of frequencies for the six languages.
findLanguage.m
The findLanguage
function should have a single input that is a string of
words in a particular language, and should return two values: (1) the name of the language,
and (2) the chi-squared statistic described below, which captures how well the sentence
fits the pattern expected for the chosen language. This function should
call your getLanguageInfo
function to get the language names and frequency data
for the six languages. For each language, the findLanguage
function should first count the number of occurrences in the input string,
of the 9 most common letters for the language. From these counts, you can then determine the
frequency of occurrence of each of these 9 letters in the input string. The observed
frequencies of occurrence can be compared to the expected frequencies for each language, to
determine how well the input sentence fits the expected data for the language. This last step
can be accomplished by calculating the
Χ2
(Chi-Squared) statistic between the observed and expected frequencies, described in the
next section. The most likely language for the input sentence is the one with the smallest value
for the Χ2 statistic.
chiSquare.m
The chiSquare
function should have two inputs corresponding to the observed
and expected frequencies for the 9 most common letters in a language, and should return the
value of the Χ2 (Chi-Squared) statistic. To calculate this statistic,
suppose you are given a vector E of the expected frequencies of occurrence of
particular events (in this case, the appearance of certain letters in a text string) and a
second vector O that contains the observed frequencies of occurrence.
The Χ2 statistic captures the difference between O and
E, and is measured as follows:
Χ2 = Σ (Oi - Ei)2/Ei
where the sum is taken over the set of frequencies.
testSentences.m
When you have completed the definitions of your getLanguageInfo, findLanguage
and chiSquare
functions, run the testing code in the testSentences.m
script to identify the languages for the test strings defined there. The findLanguage
function has two outputs, but in the testing code that is provided, the call to the
findLanguage
function is embedded directly in a disp
statement that
only uses the first output of this function, e.g.:
disp(['sentence: englishTest language: ' findLanguage(englishTest)]);
The file sentences.txt
file contains a real sentence from each of the 6
languages. Add code to the testSentences.m
script that first reads the 6 sentences
from the file, creating a cell array in which each sentence is stored as a separate string.
Call the findLanguage
function on each sentence, with output variables to store
both the language and chi-squared statistic, and write the results into a new text file named
results.txt
. When complete, the new text file should have the following content
and format (the numbers are the chi-squared values):
Spanish 19.19 entre broma y broma, la verdad se asoma German 36.33 viele koche verderben den brei Finnish 21.68 ihmetapauksiin voi toivoa mutta ala luota niihin French 13.03 il faut battre le fer pendant qu'il est chaud English 10.55 this great english sentence is too long and trite Italian 10.26 con piacere colgo l'occasione di ringraziare tutti gli amici italiani che con i loro consigli
How to turn in this assignment
Step 1. Complete
this online form.
The form asks you to estimate your time spent on the problems. We use this information to help us design
assignments for future versions of CS112. Completing the form is a requirement of submitting the assignment.
Step 2. Upload your final programs to the CS server.
When you have completed all of the work for this assignment, your
assign7_exercises
folder should
contain a new subfolder named thumbs
with a set of thumbnail images created
in the exercise, and two code files: reduceImage.m
and testReduce.m
.
Your assign7_problem
folder should at least contain the following code files:
getLanguageInfo.m, findLanguage.m, chiSquare.m,
and testSentences.m
.
Use Cyberduck to connect
to your personal account on the server and navigate to your cs112/drop/assign07
folder.
Drag your assign7_exercises
and assign7_problem
folders to this drop
folder. More details about this process can be found on the webpage on
Managing Assignment Work.
Step 3. Hardcopy submission.
Your hardcopy submission should include printouts of 6 code files,
reduceImage.m, testReduce.m, getLanguageInfo.m, findLanguage.m, chiSquare.m,
and testSentences.m
.
To save paper, you can cut and paste your code files into one file, and you only need to
submit one hardcopy for you and your partner. If you cannot submit your hardcopy in class on the due
date, please slide it under Ellen's office door.