CS 112

Assignment 7

Due: Thursday, April 22, at the start of class

You can turn in your assignment up until 5:00pm on 4/22/11 without penalty, but it is best to hand in the assignment at the beginning of class. Your hardcopy submission should include a cover sheet and printouts of six code files: translateRNA.m, rna2amino.m, translatecodon.m, getLanguageInfo.m, whichLanguage.m and chiSquare.m (you can combine your printouts into one file to save paper). Your electronic submission is described in the section Uploading your saved work

Reading

The following material in the fourth edition of the Gilat text is useful to review for this assignment: 103-110 (third edition: pages 93-100). You should also review notes and examples from Lectures #16, 17 and 19, and Lab #11.

Getting Started: Download assign7_programs from cs112d

Use Fetch or WinSCP to connect to the CS server using the cs112d account and download a copy of the assign7_programs folder from the cs112d directory onto your Desktop. Rename the folder to be yours, e.g. sohie_assign7_programs. In MATLAB, set the Current Directory to your assign7_programs folder. This folder has two subfolders named Bio and Languages that contain initial files for Problems 1 and 2.

Uploading your saved work

Use Fetch or WinSCP again to upload your saved work, but this time you should connect to the CS file server using your personal user account name and password. After logging in to your account:

When you are done with this assignment, you should have at least the following code files stored in the assign7_programs folder: translateRNA.m, rna2amino.m, translatecodon.m, getLanguageInfo.m, whichLanguage.m and chiSquare.m. The first three code files should be stored in the Bio subfolder and the second three should be stored in the Languages subfolder.

Problem 1: The mystery of life

A DNA molecule is a sequence of nucleotides. The exact order of the nucleotides determines the code of each gene. DNA is transcribed to RNA (which, like DNA, is a sequence of nucleotides), and then the RNA is translated into a protein. Each set of three contiguous RNA nucleotides codes for a single amino acid. The protein is made of a chain of amino acids hooked together. See the nobelprize.org website for more information.

How does RNA specify the amino acid sequence?

There are 4 nucleotides and 20 amino acids. Each amino acid is specified by a particular triplet of nucleotides, called a codon. The four nucleotides are represented by A, C, G and U (standing for adenine, cytosine, guanine and uracil). The 20 amino acids are abbreviated as Phe, Ser, Gly, etc. There are three codons (UAA, UAG and UGA) that act as signals to terminate translation, called Stop codons. Given an RNA nucleotide sequence, we can calculate the amino acid sequence of the resulting protein, reading off one codon at a time from the RNA. For example, the RNA nucleotide sequence GUCACCUAA would translate into Val Thr Stop (GUC translates to Val, CAC translates to Thr and UAA is a Stop codon). A table that translates a triplet of nucleotides (codon) to one amino acid is given below:

first position second position third position
  U C A G  
U Phe Ser Tyr Cys U
U Phe Ser Tyr Cys C
U Leu Ser Stop Stop A
U Leu Ser Stop Trp G
C Leu Pro His Arg U
C Leu Pro His Arg C
C Leu Pro Gln Arg A
C Leu Pro Gln Arg G
A Ile Thr Asn Ser U
A Ile Thr Asn Ser C
A Ile Thr Lys Arg A
A Met Thr Lys Arg G
G Val Ala Asp Gly U
G Val Ala Asp Gly C
G Val Ala Glu Gly A
G Val Ala Glu Gly G

To understand how to use the table, suppose the codon AUG is given. The A in the first position narrows the possible amino acids to those in the four rows highlighted in yellow above. The U in the second position further narrows the possibilities to the leftmost column of these four rows (containing the amino acids Ile and Met). The G in the third position identifies the amino acid as Met.

Creating a program to translate RNA nucleotide sequences

Your task is to complete a program that translates sample nucleotide sequences into amino acid sequences. Two files, createAminoTable.m and sequences.txt, are provided for you in the assign7_programs/Bio folder. You do not need to make any changes to these files. You should add a high-level script named translateRNA and two functions named rna2amino and translatecodon, described below.

translateRNA.m

This high-level script should read in the sample nucleotide sequences stored in the sequences.txt file, translate each sequence into its corresponding amino acids using the rna2amino function, and print the results in a new text file named translations.txt. The sequences.txt file begins with the following two examples:

GUCACCUAA
GCUCUCUGC

The contents of the translations.txt file that your program creates might begin with something like this:

translations of nucleotide sequences:

sequence: GUCACCUAA
translation: Val Thr Stop

sequence: GCUCUCUGC
translation: Ala Leu Cys

rna2amino.m

The rna2amino function should take an input string that is a nucleotide sequence (e.g. 'GCUCUCUGC') and return a string containing the corresponding amino acid sequence (e.g. 'Ala Leu Cys'). You can assume that the first codon begins with the first character of the input string. The translatecodon function can be used to translate each individual codon.

translatecodon.m

The translatecodon function should take an input string that is a single codon (e.g. 'GCU') and return the amino acid corresponding to this codon (e.g. 'Ala'), using the above table. One approach to this subtask uses what we call brute force. The table above gives all the possible combinations of codons mapped to amino acids, and a very loooooong conditional statement could perform that mapping from each of the 64 possible codons to its corresponding amino acid.

Another more compact and elegant approach to translating codons combines indexing with the application of a base 4 number system. We begin by creating one long list of 64 amino acids by concatenating the columns of the above table, in order from left to right. The createAminoTable.m code file stores this ordered list of 64 amino acids in a cell array named aminoNames, as shown below:

aminoNames = {'Phe', 'Phe', 'Leu', 'Leu', 'Leu', 'Leu', 'Leu', 'Leu', ...
              'Ile', 'Ile', 'Ile', 'Met', 'Val', 'Val', 'Val', 'Val', ...
              'Ser', 'Ser', 'Ser', 'Ser', 'Pro', 'Pro', 'Pro', 'Pro', ...
              'Thr', 'Thr', 'Thr', 'Thr', 'Ala', 'Ala', 'Ala', 'Ala', ...
              'Tyr', 'Tyr', 'Stop', 'Stop', 'His', 'His', 'Gln', 'Gln', ...
              'Asn', 'Asn', 'Lys', 'Lys', 'Asp', 'Asp', 'Glu', 'Glu', ...
              'Cys', 'Cys', 'Stop', 'Trp', 'Arg', 'Arg', 'Arg', 'Arg', ...
              'Ser', 'Ser', 'Arg', 'Arg', 'Gly', 'Gly', 'Gly', 'Gly'};

If the createAminoTable script is called inside the translatecodon function, then this function will have access to the aminoNames cell array.

A codon can be translated to a number between 1 and 64 that can be used as an index into the aminoNames cell array, that identifies the cell storing the corresponding amino acid. For example, the codon 'GUC' will be translated to the index 14, and aminoNames{14} contains the amino acid 'Val' that corresponds to this codon. The translation of the codon to an index uses the concept of a base 4 number system. A base 4 number system uses four digits, 0, 1, 2 and 3. A base 4 number with three digit places can represent the decimal numbers from 0 to 63. For example, the base 4 number 321 represents the decimal number 57 as derived below (the subscripts indicate the base for the number):

3214 = 3*(4^2) + 2*(4^1) + 1*(4^0) = 3*16 + 2*4 + 1*1 = 5710.

To calculate an index, the four nucleotides U, C, A and G are first mapped to the four digits 0-3 of the base 4 number system as follows:

U --> 0
C --> 1
A --> 2
G --> 3

A sequence of three nucleotides can be interpreted as a three-digit number in base 4, but its conversion to a decimal number that serves as an index into the aminoNames cell array is slightly different from the calculation shown above. Consider the following table in which the nucleotides U, C, A and G have been replaced with the integers 0, 1, 2 and 3, respectively, and the 64 amino acid names have been replaced with a number between 0 and 63. We can add 1 to this number to get the index of the corresponding amino acid in the aminoNames cell array.

first position second position third position
  0 1 2 3  
0 0 16 32 48 0
0 1 17 33 49 1
0 2 18 34 50 2
0 3 19 35 51 3
1 4 20 36 52 0
1 5 21 37 53 1
1 6 22 38 54 2
1 7 23 39 55 3
2 8 24 40 56 0
2 9 25 41 57 1
2 10 26 42 58 2
2 11 27 43 59 3
3 12 28 44 60 0
3 13 29 45 61 1
3 14 30 46 62 2
3 15 31 47 63 3

In this new table, the nucleotide in the second (middle) position (now an integer between 0 and 3) acts as the most significant digit in the base 4 number, determining how many multiples of 16 are contained in the equivalent decimal number stored in the table (e.g. the numbers in the leftmost column contain 0 multiples of 16, the numbers in the second column contain one multiple of 16, and so on). The integer associated with the nucleotide in the first position determines the additional multiples of four (i.e. it acts as the second most significant digit in the base 4 number), and the integer associated with the nucleotide in the rightmost position of the codon acts as the least significant digit, indicating additional multiples of 1. Consider again the example of the codon 'AUG'. This is first transformed to the three digits 2 - 0 - 3, from which we derive the decimal number 11 as follows:

0*16 + 2*4 + 3*1 = 11

Adding 1 gives an index of 12, which is the location of 'Met' in the aminoNames cell array. Using this strategy to calculate an index from the nucleotides of the codon yields much more compact code than the brute force approach described earlier.

Problem 2: Parlez-vous Francais?

The frequency of occurrence of different letters of the alphabet varies across languages and can be used to identify the language in which a particular selection of text is written. The following tables show some data on the frequency of occurrence (listed as percentages) of the nine most common letters in six languages:

  English  German  Finnish   French  Italian  Spanish
e12.31
t9.59
a8.05
o7.94
n7.19
i7.18
s6.59
r6.03
h5.14
e18.46
n11.42
i8.02
r7.14
s7.04
a5.38
t5.22
u5.01
d4.94
a12.06
i10.59
t9.76
n8.64
e8.11
s7.83
l5.86
o5.54
k5.20
e15.87
a9.42
i8.41
s7.90
t7.26
n7.15
r6.46
u6.24
l5.34
e11.79
a11.74
i11.28
o9.83
n6.88
l6.51
r6.37
t5.62
s4.98
e13.15
a12.69
o9.49
s7.60
n6.95
r6.25
i6.25
l5.94
d5.58

In this problem, you will complete a program that uses the above data to identify the language in which a sentence is written. The assign7_programs/Languages folder contains a file frequencyInfo.txt with the frequency information in the above table for the six languages. It also contains a script file testSentences.m with testing code for your program. This script file first constructs six strings of letters that are not real sentences, but just contain the right proportion of the most common letters for each language. These strings are just for testing whether your code is working ok. The file then creates a real test sentence from each language. The findLanguage function is then called with each of the test sentences. To complete this program, you should write three functions named getLanguageInfo, findLanguage and chiSquare, described below.

getLanguageInfo.m

The getLanguageInfo function should have no inputs and two outputs. It should read the information stored in the frequencyInfo.txt file and return two cell arrays. The first cell array should contain the names of the six languages, obtained from the first line of the text file (this should be a single cell array of strings, and not a cell array that is embedded in another cell array). The second cell array should contain the most common letters and their associated frequencies of occurrence for the six languages, obtained from the remaining contents of the text file. The built-in textscan function can be used with a format string that contains 12 format specifiers (%s or %f), to load the remaining contents of the text file into a cell array that has 12 elements that alternate between cell arrays of frequent letters and vectors of frequencies for the six languages.

findLanguage.m

The findLanguage function should have a single input that is a string of words in a particular language, and should return the name of the language. This function can call your getLanguageInfo function to get the language names and frequency data for the six languages. For each language, the findLanguage function should first count the number of occurrences in the input string, of the 9 most common letters for the language. From these counts, you can then determine the frequency of occurrence of each of these 9 letters in the input string. The observed frequencies of occurrence can be compared to the expected frequencies for each language, to determine how well the input sentence fits the expected data for the language. This last step can be accomplished by calculating the Χ2 (Chi-Squared) statistic between the observed and expected frequencies, described in the next section. The most likely language for the input sentence is the one with the smallest value for the Χ2 statistic.

chiSquare.m

The chiSquare function should have two inputs corresponding to the observed and expected frequencies for the 9 most common letters in a language, and should return the value of the Χ2 (Chi-Squared) statistic. To calculate this statistic, suppose you are given a vector E of the expected frequencies of occurrence of particular events (in this case, the appearance of certain letters in a text string) and a second vector O that contains the observed frequencies of occurrence. The Χ2 statistic captures the difference between O and E, and is measured as follows:

Χ2 = Σ (Oi - Ei)2/Ei

where the sum is taken over the set of frequencies.

When you have completed the definitions of your getLanguageInfo, findLanguage and chiSquare functions, run the testing code in the testSentences script to identify the languages for the tests defined there. You are also encouraged to perform your own testing, either by adding examples in the testing file, or running examples from the Command Window.