![]() |
CS 112
Assignment 7
|
|
You can turn in your assignment up until 5:00pm on 4/22/10 without penalty, but it
is best to hand in the assignment at the beginning of class.
Your hardcopy submission should include a cover
sheet and printouts of seven code
files: supplyDemand.m, loadData.m, translatecodon.m, rna2amino.m, translateRNA.m,
whichLanguage.m
and chiSquare.m
(you can combine your printouts into
one file to save paper). Your electronic submission is described
in the section Uploading your saved work
assign7_programs
folder from the cs112d directory
onto your Desktop. Rename the folder to be yours, e.g. sohie_assign7_programs
.
In MATLAB, set the Current Directory to your assign7_programs
folder.
drop/assign7
folder
assign7_programs
folder into your
drop/assign7
folder
assign7_programs
folder from the Desktop by dragging
it to the trash can, and then empty the trash (Finder--> Empty Trash).
When you are done with this assignment, you should have at least the following code
files that you wrote or modified, stored in your assign7_programs
folder:
supplyDemand.m, loadData.m, translatecodon.m, rna2amino.m, translateRNA.m,
whichLanguage.m
and chiSquare.m
.
We're going to revisit the supply and demand problem from Assignment 5.
You can start with your own solutions or the course solutions (contained in the
SupplyDemand_revisited
subfolder inside the assign7_programs
folder),
which include:
assign7_programs
folder:
restaurant.txt and iphone.txt.
The restaurant.txt file contains supply and demand data for dining at five restaurants in Wellesley. Each of the five restaurants can seat 30 couples.
Below are the contents of the restaurant.txt text file in the
assign7_programs
folder. The leftmost column lists a set
of prices, the middle column lists the number of seats (where 10 seats
means seating for 10 couples) available at that price, and the third column lists the demand from the hungry couples of Wellesley.
15 0 250 20 30 200 30 60 140 40 75 60 50 90 50 75 125 40 100 150 15
The iphone.txt file (shown in the box below) contains supply and demand data for iphones from different regions of the world.
iphone supply and demand data fabricated by Sohie, cs112 April 2010 Supply and Demand Quantities in Thousands (x10^3) Price Supply Demand(USA) Demand(Europe) Demand(Asia) Demand(Canada) $1,500.00 800 2.980 5.808 3.864 3.765 $900.00 650 5.767 8.649 9.986 6.755 $500.00 500 35.876 24.855 29.544 28.087 $400.00 225 45.645 30.786 67.211 45.775 $350.00 100 90.656 55.551 106.656 80.099 $250.00 50 120.771 126.191 232.799 129.632 $100.00 20 223.721 246.687 356.053 145.997 $55.00 1 523.875 364.866 467.524 272.075Notes about these text files:
textread
is how MATLAB reads text from files.
Click here for lab textread
pointers
(Note: in future versions of MATLAB,
textscan
will replace textread
as the preferred function for reading text files)
$%f
EDU>> supplyDemand :: Welcome to the CS112 Supply and Demand Version 2.0 program! :: Select a data source, view supply and demand curves, :: see the equilibrium price and quantity, and explore :: how these values change with supply and demand select the data to analyze: mathworks (1), widget (2), file (3): 3 Please enter the filename ==> restaurants.txt Reading in data from restaurants.txt Please type your output file name => restaurantOut.txt keep current display? yes (1) no (0): 1 Equilibrium price: $44.2507 Equilibrium quantity: 79.8298 specify the change in supply or demand as a fraction of the maximum quantity present in the current supply or demand curves change in supply (-0.5 to 0.5): 0.2 change in demand (-0.5 to 0.5): 0.0 keep current display? yes (1) no (0): 1 Equilibrium price: $36.4401 Equilibrium quantity: 95.4505 keep going? yes (1), no(0): 1 change in supply (-0.5 to 0.5): 0.4 change in demand (-0.5 to 0.5): 0.2 keep current display? yes (1) no (0): 1 Equilibrium price: $41.5196 Equilibrium quantity: 134.955 keep going? yes (1), no(0): 0 EDU>>
31-Mar-2010 23:04 Supply Price Demand Price Demand Supply 15 15 250.00 0.00 20 20 200.00 30.00 30 30 140.00 60.00 40 40 60.00 60.00 50 50 50.00 90.00 75 75 40.00 125.00 100 100 15.00 150.00 Estimated equilibrium price: $44.25 Estimated equilibrium quantity: 80 ** Change in supply: 0.20 ** Change in demand: 0.00 Estimated equilibrium price: $36.44 Estimated equilibrium quantity: 95 ** Change in supply: 0.40 ** Change in demand: 0.20 Estimated equilibrium price: $41.52 Estimated equilibrium quantity: 135
How does RNA specify the amino acid sequence?
There are 4 nucleotides and 20 amino acids. Each amino acid
is specified by a particular triplet of nucleotides, called a codon.
The four nucleotides are represented by A, C, G and U (standing for adenine,
cytosine, guanine and uracil, respectively). The 20 amino acids
are abbreviated as Phe, Ser, Gly, etc. There are three codons (UAA, UAG and UGA) that act
as signals to terminate translation, and these are called STOP codons.
Given an RNA nucleotide sequence, we can calculate the amino acid sequence of the resulting protein, reading off one codon at a time from the RNA. For example, 'GUCACCUAA' would translate into ValThrStop. The table that translates from a triplet of nucleotides (a codon) to one amino acid is given below:
first position | second position | third position | |||
U | C | A | G | ||
U | Phe | Ser | Tyr | Cys | U |
U | Phe | Ser | Tyr | Cys | C |
U | Leu | Ser | Stop | Stop | A |
U | Leu | Ser | Stop | Trp | G |
C | Leu | Pro | His | Arg | U |
C | Leu | Pro | His | Arg | C |
C | Leu | Pro | Gln | Arg | A |
C | Leu | Pro | Gln | Arg | G |
A | Ile | Thr | Asn | Ser | U |
A | Ile | Thr | Asn | Ser | C |
A | Ile | Thr | Lys | Arg | A |
A | Met | Thr | Lys | Arg | G |
G | Val | Ala | Asp | Gly | U |
G | Val | Ala | Asp | Gly | C |
G | Val | Ala | Glu | Gly | A |
G | Val | Ala | Glu | Gly | G |
lab | seat | row | ||
A | B | C | ||
8 | 'ClaraB' | 'Ewelina' | 'Tiffany' | 1 |
8 | 'Sarah' | 'Jenny' | 'Marken' | 2 |
8 | 'Simone' | 'Michelle' | 'Leslie' | 3 |
9 | 'Victoria' | 'Christina' | 'Jessica' | 1 |
9 | 'Rifaiyat' | 'Harriet' | 'ClaraW' | 2 |
9 | 'Serena' | 'Lily' | 'Jon Bon Jovi' | 3 |
For example, '8C2' gives Marken's assigned seat for the rest of the semester (8am lab, seat C in the second row). There is a mapping that exists from lab time, seat and row to CS112 student. One method of simplifying the mapping is to collapse our table into one long line of data, and then use an index to access the data. We have three aspects of our seating chart: lab time, seat and row. Suppose we assign lab time 8am to 0 and lab time 9am to 1; seats A, B and C to 0, 1 and 2, respectively, and leave row intact. Then we can figure out who is in which seat by the following formula:
studentindex = 3*time + 6*seat + row
Let's look at the formula more carefully.
Why are we multiplying time
by 3?
And why are we multiplying seat
by 6?
If we collapse all the names into one long cell array by concatenating the
columns like this:
names = {'ClaraB' 'Sarah' 'Simone' 'Victoria' 'Rifaiyat' 'Serena' 'Ewelina' 'Jenny' 'Michelle'
...}
then we can use studentindex
to index our names
cell array and
retrieve the student's name, given her seat assignment.
This seating chart example is intended to serve as a guideline if you choose to translate from RNA nucleotides into amino acids by using indexing, rather than brute force conditional statements.
As a parallel, let's take a look at the number system we use in our everyday lives, which is base 10: We have 10 symbols (0, 1 ,2 ,..., 9), which are combined to create sequences, of any length, also known as numbers, each of them having a value. In particular, here is an example of how the value of a base-10 number can be calculated:
357 --> 3*100 + 5*10 + 7*1, which can be seen as:
357 --> 3*10^2 + 5*10^1 + 7*10^0
Going back to the amino acid table, and the base-4 system, you can apply similar logic
to get the decimal value of a codon. First we need to map the codon (a sequence of 3 letters,
U, C, A, G) into a numeric sequence:
U --> 0
C --> 1
A --> 2, and
G --> 3.
Now we can find the decimal value of such a sequence, as we did with the decimal number above.
Here is an example:
GAC --> 321 --> (most sigificant digit * 4^2) + (second most significant digit * 4^1) +
(least significant digit * 4^0) --> 2*16 + 3*4 + 1*1 --> 45
Notice that according to the way the amino acids are placed in the given table, the most significant digit is not the left-most one - as we are used to in the decimal system - but the middle one. Also, because MATLAB starts counting at 1, as opposed to 0, we need to add 1 to the above result, before we use it to index into the amino acid table.
assign7_programs
folder contains a subfolder named Bio
with
two files for this problem. The script named createAminoTable.m
creates a variable
named aminoNames
, which is a cell array with all the amino acid names that
appear in the table provided earlier. The file sequences.txt
contains some
nucleotide sequences to use for testing.
As you write your code for this problem, consider breaking it into smaller parts,
each of which can be implemented separately. For example, you could begin by writing (and testing)
code for each of the following smaller parts:
Write a script file called translateRNA.m
that reads in the sample nucleotide
sequences contained in sequences.txt
using textread
, and then steps
through the sample sequences and prints the translation of each sequence into amino acids.
Below is some sample MATLAB output from translateRNA
:
>> translateRNA sequence 1: Val Thr Stop sequence 2: Ala Leu Cys sequence 3: Ile Met Ala Trp Thr StopLys sequence 4: Tyr Leu Ser Ile Tyr Leu Ser Ile sequence 5: Leu Tyr StopSer Leu StopGln sequence 6: Gln Thr Val Glu Arg Ala Leu sequence 7: Arg Cys Arg Ala Thr Leu Arg Val Ser >>
The frequency of occurrence of different letters of the alphabet varies across languages and can be used to identify the language in which a particular selection of text is written. The following tables show some data on the frequency of occurrence (listed as percentages) of the nine most common letters in six languages:
English | German | Finnish | French | Italian | Spanish | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
In this problem, you will complete a program that uses the above data to identify
the language in which a sentence is written. There are two code files for this problem in the
Languages
subfolder in the assign7_programs
folder:
setupTables.m
script file constructs a cell array named languages
that
stores the information in the above tables. The languages
cell
array consists of 6 nested cell arrays that each contain three elements:
the name of the language, a string of the 9 most common letters, and a vector
of the expected frequencies of occurrence of these 9 letters. testSentences.m
script
file contains testing code for your program. This file first constructs 6
strings of letters that are not real sentences, but just contain the right
proportion of the most common letters for each language. These strings are
just for testing whether your code is working ok. The file then creates a
real test sentence from each language. The findLanguage
function
is then called with each of the test sentences. To complete this program, you should:
findLanguage
function. Think about the
appropriate input(s) and output of this function. For each language, this
function should first count the number of occurrences in the examined string,
of the 9 most common letters for each language. From these counts, you can
then determine the frequency of occurrence of each of these 9 letters
in the examined string. The observed frequencies of occurrence can be compared
to the expected frequencies for each language, to determine how well the
input sentence fits the expected data for the language. This last step can
be accomplished by calculating the Χ2
(Chi-Squared) statistic between the observed and expected frequencies,
described in the note below. The most likely language for the input sentence
is the one with the smallest value for the Χ2 statistic.findLanguage
function,
run the testing code in the testSentences
script to identify
the languages for the tests defined there. You are also encouraged
to perform your own testing, either by adding examples in the testing file,
or running examples from the Command Window.Suppose you are given a vector E of the expected frequencies of occurrence of particular events (in this case, the appearance of certain letters in a text string) and a second vector O that contains the observed frequencies of occurrence. The Χ2 statistic captures the difference between O and E, and is measured as follows:
Χ2 = Σ (Oi - Ei)2/Ei
where the sum is taken over the set of frequencies.