CS 112

Assignment 5

Due: Friday, November 3, at the beginning of class

You can turn in your assignment up until 5:00pm on 11/3/06 without penalty, but it is best to hand in the assignment at the beginning of class. Your hardcopy submission should include a cover sheet and printouts of 7 code files: findPeaksAndValleys.m, anagram.m, translatecodon.m, rna2amino.m, translateRNA.m, chiSquare.m, and whichLanguage.m. The cover page includes the number of points that each exercise and problem is worth, written in parentheses in the table. Your electronic submission is described in the section Uploading your saved work

Reading

The following new material in the text is especially useful to review for this assignment: Sections 8.3, 8.4, 8.8 and pp. 56-57, 100-101, and Sect. 4.3 (omit the section on Structures). You should also review notes and examples from Lectures #12-14, and Lab #7.

Getting Started: Download assign5_programs from cs112d

Use Fetch or WinSCP to connect to the CS server using the cs112d account and download a copy of the assign5_programs folder from the cs112d directory onto your Desktop. Rename the folder to be yours, e.g. sohie_assign5_programs. In MATLAB, set the Current Directory to your assign5_programs folder.

Uploading your saved work

Use Fetch or WinSCP again to upload your saved work, but this time you should connect to the CS file server using your personal user account name and password. After logging in to your account:

When you are done with this assignment, you should have 7 code files stored in your assign5_programs folder: findPeaksAndValleys.m, anagram.m, translatecodon.m, rna2amino.m, translateRNA.m, chiSquare.m, and whichLanguage.m.

Exercise 1: Peaks and valleys

Finding local minima and local maxima often comes in handy in different fields (financial markets, brain wave activity and a topographic map are shown below).
Here is a plot that illustrates both a local minimum and global minimum as well as a local maximum and global maximum for a particular mathematical function.

You will write a MATLAB function called findPeaksandValleys that will take one input parameter. The input will be a 2D matrix of data (that we will provide, see below). Your function will determine where the local minima and maxima are and print out their locations. In this problem, a local minimum is defined as a value in the matrix being smaller than the values above it, below it, to the right and left of it. In order to be a local minimum/maximum, a value must have four neighbors (above, below, left and right). Two examples with the local minimum are shown in green below. Each 3x3 figure represents a subset of a 2D matrix.

  11  
7 5 12
  6  
 
  3  
12 2 5
  4  

A local maximum (each pink cell below), on the other hand, is a value that is greater than its surrounding neighbors.

  10  
2 18 15
  1  
 
  24  
13 31 29
  30  

In your assign5_programs folder, there is a file called setUpPeaksAndValleys.m that sets up and returns the 2D surface matrix. You'll notice that setUpPeaksAndValleys also plots the surface in 3D. You will write findPeaksAndValleys to determine the locations of any local minima and maxima and print out those locations. When you run findPeaksAndValleys, your output should look something like this:


>> 
>> findPeaksAndValleys(setUpPeaksAndValleys)
valley at row 8, col 17
peak at row 12, col 13
peak at row 15, col 22
valley at row 16, col 9
valley at row 17, col 17
peak at row 23, col 15
>> 

Exercise 2: Anagrams

What do these pairs of words and phrases have in common?

elvis   lives
listen   silent
orchestra   carthorse
Angered   Enraged
deductions   discounted
'Astronomers '   'Moon starers'

Note: There is an extra space after Astronomers that is used between the words in the corresponding anagram

For a pair of words or phrases to be an anagram, they must use the exact same letters and spaces, but simply in a different order.

You will write a MATLAB function called anagram.m that will take two words/phrases as input, determine whether the two words or phrases are anagrams, and print an appropriate message. Your anagram function does not need to return a value, but it does need to print a message indicating whether or not the two inputs are anagrams. Your output should also print the two words/phrases that were supplied as inputs to anagram.

Here is some sample MATLAB output:

>> 
>> anagram('elvis','lives')
elvis and lives are anagrams
>> 
>> anagram('listen','silent')
listen and silent are anagrams
>> 
>> anagram('matlab', 'never ending fun')
matlab and never ending fun are not anagrams
>> 
>> anagram('Tom Cruise ','So Im Cuter')
Tom Cruise  and So Im Cuter are anagrams
>> 
>> anagram('The Morse Code', 'Here come dots')
The Morse Code and Here come dots are anagrams
>> 
>> anagram('The Morse Code ', 'Here come dots')
The Morse Code  and Here come dots are not anagrams
>> 
>> anagram('Angered','Enraged')
Angered and Enraged are anagrams
>> 
>> anagram('matlab','matlab')
matlab and matlab are not anagrams
>> 
Some things to note in the above output:

  1. In the Tom Cruise example, there is an extra space after his name, since two spaces are used in 'So Im Cuter'. Also note there is no apostrophe in the Im contraction.
  2. The Morse Code example has two spaces, so we did not need to add an extra. In fact, if we do try to add an extra space, then the two phrases are no longer anagrams (see second Morse Code example above).
  3. anagram.m detects anagrams independent of case (see 'Angered' and 'Enraged')
  4. The message printed by anagram.m prints the original words/phrases as they were supplied to the anagram.m function
  5. Two words/phrases are NOT anagrams if they are identical words/phrases. The definition of anagram includes re-ordering of the letters and spaces. The last example above illustrates how two words that are the same are not anagrams.
  6. Might be helpful to think about how to tell if two words/phrases are not anagrams as well as how to tell if they indeed are.

Problem 1: The mystery of life

A DNA molecule is a sequence of nucleotides. The exact order of the nucleotides determines the code of each gene. DNA is transcribed to RNA (which, like DNA, is a sequence of nucleotides), and then the RNA is translated into a protein. Each set of three contiguous RNA nucleotides codes for a single amino acid. The protein is made of a chain of amino acids hooked together. Here is a link to a site with more background information.

How does RNA specify the amino acid sequence?
There are 4 nucleotides and 20 amino acids. Each amino acid is specified by a particular triplet of nucleotides, called a codon. The four nucleotides are represented by A, C, G or U (standing for adenine, cytosine, guanine and uracil, respectively). The 20 amino acids are abbreviated as Phe, Ser, Gly, etc. There are three codons (UAA, UAG and UGA) that act as signals to terminate translation, and these are called STOP codons.

Given a RNA nucleotide sequence, we can calculate the amino acid sequence of the resulting protein, reading off one codon at a time from the RNA. For example, 'GUCACCUAA' would translate into ValThrStop. The table that translates from a triplet of nucleotides (a codon) to one amino acid is given below (and is on page 178 of the Kaplan text):

first position second position third position
  U C A G  
U Phe Ser Tyr Cys U
U Phe Ser Tyr Cys C
U Leu Ser Stop Stop A
U Leu Ser Stop Trp G
C Leu Pro His Arg U
C Leu Pro His Arg C
C Leu Pro Gln Arg A
C Leu Pro Gln Arg G
A Ile Thr Asn Ser U
A Ile Thr Asn Ser C
A Ile Thr Lys Arg A
A Met Thr Lys Arg G
G Val Ala Asp Gly U
G Val Ala Asp Gly C
G Val Ala Glu Gly A
G Val Ala Glu Gly G

Translating codons

One subproblem of this task is to translate a codon (e.g. 'GUC') into an amino acid (e.g. Val) by using the table above.

This seating chart example is intended to be a guideline for helping you if you choose to translate from RNA nucleotides into amino acids by using indexing (rather than brute force conditionals).

What you need to write

Here is what you need to code for this problem:
  1. A MATLAB function called translatecodon that takes a codon string as input and returns the corresponding amino acid given in the table above.
  2. A MATLAB function called rna2amino that takes a sequence of RNA nucleotides (e.g. 'GUCACCUAA') and a starting point and translates the entire sequence into a sequence of amino acids (e.g. 'ValThrStop'). rna2amino will call translatecodon.
  3. A MATLAB script file called translateRNA.m that will step through a list of sequence samples in a text file called sequences.txt and print the translation of each sequence into amino acids. You will find the file sequences.txt in your assign5_programs folder. The textread function can be used to read the lines of text of this file into a cell array.
Below is some sample MATLAB output from translateRNA:


>> 
>> translateRNA
sequence 1: Val Thr Stop
sequence 2: Ala Leu Cys 
sequence 3: Ile Met Ala Trp Thr StopLys 
sequence 4: Tyr Leu Ser Ile Tyr Leu Ser Ile 
sequence 5: Leu Tyr StopSer Leu StopGln 
sequence 6: Gln Thr Val Glu Arg Ala Leu 
sequence 7: Arg Cys Arg Ala Thr Leu Arg Val Ser 
>> 
>> 

Problem 2: Parlez-vous Francais?

The frequency of occurrence of different letters of the alphabet varies across languages and can be used to identify the language in which a particular selection of text is written. The following tables show some data on the frequency of occurrence (listed as percentages) of the nine most common letters in six languages:

  English  German  Finnish   French  Italian  Spanish
e12.31
t9.59
a8.05
o7.94
n7.19
i7.18
s6.59
r6.03
h5.14
e18.46
n11.42
i8.02
r7.14
s7.04
a5.38
t5.22
u5.01
d4.94
a12.06
i10.59
t9.76
n8.64
e8.11
s7.83
l5.86
o5.54
k5.20
e15.87
a9.42
i8.41
s7.90
t7.26
n7.15
r6.46
u6.24
l5.34
e11.79
a11.74
i11.28
o9.83
n6.88
l6.51
r6.37
t5.62
s4.98
e13.15
a12.69
o9.49
s7.60
n6.95
r6.25
i6.25
l5.94
d5.58

In this problem, you will complete a program that uses the above data to identify the language in which a sentence is written. There are two code files in the assign5_programs folder for this problem. The setUpTables.m script file constructs a cell array named languages that stores the information in the above tables. The languages cell array consists of 6 nested cell arrays that each contain three elements: the name of the language, a vector of the 9 most common letters, and a vector of the expected frequencies of occurrence of these 9 letters. The testSentences.m script file contains testing code for your program. This file first constructs 6 strings of letters that are not real sentences, but just contain the right proportion of the most common letters for each language. These strings are just for testing whether your code is working ok. The file then creates a few real test sentences from different languages. The whichLanguage function, which you will write, is then called with each of the test sentences.

To complete this program, you should write two functions, whichLanguage and chiSquare. The whichLanguage function should have two inputs: a string of text and the cell array of language data that is created by the setUpTables script. It should have a single output that is the name of the most likely language for the input string. For each language, this function should first count the number of occurrences in the input string, of the 9 most common letters for this language. From these counts, you can then determine the frequency of occurrence of each of these 9 letters in the input string. The observed frequencies of occurrence can be compared to the expected frequencies for this language, to determine how well the input sentence fits the expected data for this language. This last step can be accomplished by calculating the Χ2 (Chi-Squared) statistic between the observed and expected frequencies, described below. The most likely language for the input sentence is the one with the smallest value for the Χ2 statistic.

Suppose you are given a vector E of the expected frequencies of occurrence of particular events (in this case, the appearance of certain letters in a text string) and a second vector O that contains the observed frequencies of occurrence. The Χ2 statistic captures the difference between O and E, and is measured as follows:

Χ2 = Σ (Oi - Ei)2/Ei

where the sum is taken over the set of frequencies. Write a separate function chiSquare that has two vector inputs corresponding to the observed and expected frequencies, and returns the value of this statistic.

When you have completed these two functions, you can run the testing code in the testSentences script to identify the languages for the examples.