![]() |
CS 112
Assignment 7
|
|
You can turn in your assignment up until 5:00pm on 4/22/11 without penalty, but it
is best to hand in the assignment at the beginning of class.
Your hardcopy submission should include a cover
sheet and printouts of six code
files: translateRNA.m, rna2amino.m, translatecodon.m, getLanguageInfo.m,
whichLanguage.m
and chiSquare.m
(you can combine your printouts into
one file to save paper). Your electronic submission is described
in the section Uploading your saved work
assign7_programs
folder from the cs112d directory
onto your Desktop. Rename the folder to be yours, e.g. sohie_assign7_programs
.
In MATLAB, set the Current Directory to your assign7_programs
folder. This
folder has two subfolders named Bio
and Languages
that
contain initial files for Problems 1 and 2.
drop/assign07
folder
assign7_programs
folder into your
drop/assign07
folder
assign7_programs
folder from the Desktop by dragging
it to the trash can, and then empty the trash (Finder--> Empty Trash).
When you are done with this assignment, you should have at least the following code
files stored in the assign7_programs
folder: translateRNA.m, rna2amino.m, translatecodon.m, getLanguageInfo.m,
whichLanguage.m
and chiSquare.m
. The first three code files should
be stored in the Bio
subfolder and the second three should be stored in
the Languages
subfolder.
A DNA molecule is a sequence of nucleotides. The exact order of the nucleotides determines the code of each gene. DNA is transcribed to RNA (which, like DNA, is a sequence of nucleotides), and then the RNA is translated into a protein. Each set of three contiguous RNA nucleotides codes for a single amino acid. The protein is made of a chain of amino acids hooked together. See the nobelprize.org website for more information.
first position | second position | third position | |||
U | C | A | G | ||
U | Phe | Ser | Tyr | Cys | U |
U | Phe | Ser | Tyr | Cys | C |
U | Leu | Ser | Stop | Stop | A |
U | Leu | Ser | Stop | Trp | G |
C | Leu | Pro | His | Arg | U |
C | Leu | Pro | His | Arg | C |
C | Leu | Pro | Gln | Arg | A |
C | Leu | Pro | Gln | Arg | G |
A | Ile | Thr | Asn | Ser | U |
A | Ile | Thr | Asn | Ser | C |
A | Ile | Thr | Lys | Arg | A |
A | Met | Thr | Lys | Arg | G |
G | Val | Ala | Asp | Gly | U |
G | Val | Ala | Asp | Gly | C |
G | Val | Ala | Glu | Gly | A |
G | Val | Ala | Glu | Gly | G |
To understand how to use the table, suppose the codon AUG is given. The A in the first position narrows the possible amino acids to those in the four rows highlighted in yellow above. The U in the second position further narrows the possibilities to the leftmost column of these four rows (containing the amino acids Ile and Met). The G in the third position identifies the amino acid as Met.
Your task is to complete a program that translates sample nucleotide sequences into
amino acid sequences. Two files, createAminoTable.m
and sequences.txt
,
are provided for you in the assign7_programs/Bio
folder. You do not
need to make any changes to these files.
You should add a high-level script named translateRNA
and two functions
named rna2amino
and translatecodon
, described below.
translateRNA.m
This high-level script should read in the sample nucleotide sequences stored in the
sequences.txt
file, translate each sequence into its corresponding
amino acids using the rna2amino
function, and print the results in a new text
file named translations.txt
.
The sequences.txt
file begins with the following two examples:
GUCACCUAA
GCUCUCUGC
The contents of the translations.txt
file that your program creates might
begin with something like this:
translations of nucleotide sequences: sequence: GUCACCUAA translation: Val Thr Stop sequence: GCUCUCUGC translation: Ala Leu Cys
rna2amino.m
The rna2amino
function should take an input string that is a nucleotide
sequence (e.g. 'GCUCUCUGC') and return a string containing the corresponding amino acid
sequence (e.g. 'Ala Leu Cys'). You can assume that the first codon begins with the first
character of the input string. The translatecodon
function can be used to
translate each individual codon.
translatecodon.m
The translatecodon
function should take an input string that is a single
codon (e.g. 'GCU') and return the amino acid corresponding to this codon (e.g. 'Ala'),
using the above table. One approach to this subtask uses what we call
brute force. The table above gives all the possible combinations of codons
mapped to amino acids, and a very loooooong conditional statement could perform that
mapping from each of the 64 possible codons to its corresponding amino acid.
Another more compact and elegant approach to translating codons combines indexing
with the application of a base 4 number system. We begin by creating one long list of
64 amino acids by concatenating the columns of the above table, in order from left to
right. The createAminoTable.m
code file stores this ordered list of 64 amino
acids in a cell array named aminoNames
, as shown below:
aminoNames = {'Phe', 'Phe', 'Leu', 'Leu', 'Leu', 'Leu', 'Leu', 'Leu', ... 'Ile', 'Ile', 'Ile', 'Met', 'Val', 'Val', 'Val', 'Val', ... 'Ser', 'Ser', 'Ser', 'Ser', 'Pro', 'Pro', 'Pro', 'Pro', ... 'Thr', 'Thr', 'Thr', 'Thr', 'Ala', 'Ala', 'Ala', 'Ala', ... 'Tyr', 'Tyr', 'Stop', 'Stop', 'His', 'His', 'Gln', 'Gln', ... 'Asn', 'Asn', 'Lys', 'Lys', 'Asp', 'Asp', 'Glu', 'Glu', ... 'Cys', 'Cys', 'Stop', 'Trp', 'Arg', 'Arg', 'Arg', 'Arg', ... 'Ser', 'Ser', 'Arg', 'Arg', 'Gly', 'Gly', 'Gly', 'Gly'};
If the createAminoTable
script is called inside the translatecodon
function, then this function will have access to the aminoNames
cell array.
A codon can be translated to a number between 1 and 64 that can be used as an
index into the aminoNames
cell array, that identifies the cell storing
the corresponding amino acid. For example, the codon 'GUC' will be translated to the
index 14, and aminoNames{14}
contains the amino acid 'Val' that corresponds
to this codon. The translation of the codon to an index uses the concept of
a base 4 number system. A base 4 number system uses four digits, 0, 1, 2 and 3. A
base 4 number with three digit places can represent the decimal numbers from 0 to 63.
For example, the base 4 number 321 represents the decimal number 57 as derived below
(the subscripts indicate the base for the number):
3214 = 3*(4^2) + 2*(4^1) + 1*(4^0) = 3*16 + 2*4 + 1*1 = 5710.
To calculate an index, the four nucleotides U, C, A and G are first mapped to the four digits 0-3 of the base 4 number system as follows:
U --> 0
C --> 1
A --> 2
G --> 3
A sequence of three nucleotides can be interpreted as a three-digit number in base
4, but its conversion to a decimal number that serves as an index into the
aminoNames
cell array is slightly different from the calculation shown above.
Consider the following table in which the nucleotides U, C, A and G have been replaced
with the integers 0, 1, 2 and 3, respectively, and the 64 amino acid names have been
replaced with a number between 0 and 63. We can add 1 to this number to get the index
of the corresponding amino acid in the aminoNames
cell array.
first position | second position | third position | |||
0 | 1 | 2 | 3 | ||
0 | 0 | 16 | 32 | 48 | 0 |
0 | 1 | 17 | 33 | 49 | 1 |
0 | 2 | 18 | 34 | 50 | 2 |
0 | 3 | 19 | 35 | 51 | 3 |
1 | 4 | 20 | 36 | 52 | 0 |
1 | 5 | 21 | 37 | 53 | 1 |
1 | 6 | 22 | 38 | 54 | 2 |
1 | 7 | 23 | 39 | 55 | 3 |
2 | 8 | 24 | 40 | 56 | 0 |
2 | 9 | 25 | 41 | 57 | 1 |
2 | 10 | 26 | 42 | 58 | 2 |
2 | 11 | 27 | 43 | 59 | 3 |
3 | 12 | 28 | 44 | 60 | 0 |
3 | 13 | 29 | 45 | 61 | 1 |
3 | 14 | 30 | 46 | 62 | 2 |
3 | 15 | 31 | 47 | 63 | 3 |
In this new table, the nucleotide in the second (middle) position (now an integer between 0 and 3) acts as the most significant digit in the base 4 number, determining how many multiples of 16 are contained in the equivalent decimal number stored in the table (e.g. the numbers in the leftmost column contain 0 multiples of 16, the numbers in the second column contain one multiple of 16, and so on). The integer associated with the nucleotide in the first position determines the additional multiples of four (i.e. it acts as the second most significant digit in the base 4 number), and the integer associated with the nucleotide in the rightmost position of the codon acts as the least significant digit, indicating additional multiples of 1. Consider again the example of the codon 'AUG'. This is first transformed to the three digits 2 - 0 - 3, from which we derive the decimal number 11 as follows:
0*16 + 2*4 + 3*1 = 11
Adding 1 gives an index of 12, which is the location of 'Met' in the aminoNames
cell array. Using this strategy to calculate an index from the nucleotides of the codon yields
much more compact code than the brute force approach described earlier.
The frequency of occurrence of different letters of the alphabet varies across languages and can be used to identify the language in which a particular selection of text is written. The following tables show some data on the frequency of occurrence (listed as percentages) of the nine most common letters in six languages:
English | German | Finnish | French | Italian | Spanish | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
In this problem, you will complete a program that uses the above data to identify
the language in which a sentence is written. The assign7_programs/Languages
folder contains a file frequencyInfo.txt
with the frequency information in
the above table for the six languages. It also contains a script file testSentences.m
with testing code for your program. This script file first constructs six strings of letters that
are not real sentences, but just contain the right proportion of the most common letters for each
language. These strings are just for testing whether your code is working ok. The file then
creates a real test sentence from each language. The findLanguage
function
is then called with each of the test sentences. To complete this program, you should write three
functions named getLanguageInfo, findLanguage
and chiSquare
,
described below.
getLanguageInfo.m
The getLanguageInfo
function should have no inputs and two outputs. It should
read the information stored in the frequencyInfo.txt
file and return two cell
arrays. The first cell array should contain the names of the six languages, obtained from the
first line of the text file (this should be a single cell array of strings, and not a cell
array that is embedded in another cell array). The second cell array should contain the most
common letters
and their associated frequencies of occurrence for the six languages, obtained from the
remaining contents of the text file. The built-in textscan
function can be used
with a format string that contains 12 format specifiers (%s
or %f
),
to load the remaining contents of the text file into a cell array that has 12 elements that alternate
between cell arrays of frequent letters and vectors of frequencies for the six languages.
findLanguage.m
The findLanguage
function should have a single input that is a string of
words in a particular language, and should return the name of the language. This function
can call your getLanguageInfo
function to get the language names and frequency data
for the six languages. For each language, the findLanguage
function should first count the number of occurrences in the input string,
of the 9 most common letters for the language. From these counts, you can then determine the
frequency of occurrence of each of these 9 letters in the input string. The observed
frequencies of occurrence can be compared to the expected frequencies for each language, to
determine how well the input sentence fits the expected data for the language. This last step
can be accomplished by calculating the
Χ2
(Chi-Squared) statistic between the observed and expected frequencies, described in the
next section. The most likely language for the input sentence is the one with the smallest value
for the Χ2 statistic.
chiSquare.m
The chiSquare
function should have two inputs corresponding to the observed
and expected frequencies for the 9 most common letters in a language, and should return the
value of the Χ2 (Chi-Squared) statistic. To calculate this statistic,
suppose you are given a vector E of the expected frequencies of occurrence of
particular events (in this case, the appearance of certain letters in a text string) and a
second vector O that contains the observed frequencies of occurrence.
The Χ2 statistic captures the difference between O and
E, and is measured as follows:
Χ2 = Σ (Oi - Ei)2/Ei
where the sum is taken over the set of frequencies.
When you have completed the definitions of your getLanguageInfo, findLanguage
and chiSquare
functions, run the testing code in the testSentences
script to identify the languages for the tests defined there. You are also encouraged
to perform your own testing, either by adding examples in the testing file, or running examples
from the Command Window.