CS112 Assignment 7

CS 112

Assignment 7

Due: Thursday, April 22, at the start of class

You can turn in your assignment up until 5:00pm on 4/22/10 without penalty, but it is best to hand in the assignment at the beginning of class. Your hardcopy submission should include a cover sheet and printouts of seven code files: supplyDemand.m, loadData.m, translatecodon.m, rna2amino.m, translateRNA.m, whichLanguage.m and chiSquare.m (you can combine your printouts into one file to save paper). Your electronic submission is described in the section Uploading your saved work

Reading

The following material in the text is useful to review for this assignment: pages 93-100. You should also review notes and examples from Lectures #16-18 and Lab #8.

Getting Started: Download assign7_programs from cs112d

Use Fetch or WinSCP to connect to the CS server using the cs112d account and download a copy of the assign7_programs folder from the cs112d directory onto your Desktop. Rename the folder to be yours, e.g. sohie_assign7_programs. In MATLAB, set the Current Directory to your assign7_programs folder.

Uploading your saved work

Use Fetch or WinSCP again to upload your saved work, but this time you should connect to the CS file server using your personal user account name and password. After logging in to your account:

Browse through your directory and find the drop/assign7 folder
Upload your assign7_programs folder into your drop/assign7 folder
When done uploading, be sure to delete your renamed assign7_programs folder from the Desktop by dragging it to the trash can, and then empty the trash (Finder--> Empty Trash).
Be sure to Exit out of MATLAB when you are done

When you are done with this assignment, you should have at least the following code files that you wrote or modified, stored in your assign7_programs folder: supplyDemand.m, loadData.m, translatecodon.m, rna2amino.m, translateRNA.m, whichLanguage.m and chiSquare.m.

Exercise: ECON 101 - Supply and Demand

We're going to revisit the supply and demand problem from Assignment 5. You can start with your own solutions or the course solutions (contained in the SupplyDemand_revisited subfolder inside the assign7_programs folder), which include:

supplyDemand: script for exploring the interaction of supply and demand
loadData: loads data from one of two sources and returns supply/demand prices and quantities
computeEquilibrium: computes and returns equilibrium price and quantity, given input supply/demand schedules and changes in supply/demand
displayCurves: displays the input supply/demand curves, with a dot indicating the estimated equilibrium point

In this exercise, you'll extend the code to incorporate file input and output. Specifically, your program will be able to 1) read supply and demand data from a text file and 2) also write the results to an output text file.

Part 1: Reading supply and demand data from a text file

Modify the loadData function so that the user is given a third option of reading the supply and demand data from a text file. If the user selects this option, your code should prompt the user for the name of the text file (examples are given below). There are two sample text files in the assign7_programs folder: restaurant.txt and iphone.txt.

The restaurant.txt file contains supply and demand data for dining at five restaurants in Wellesley. Each of the five restaurants can seat 30 couples. Below are the contents of the restaurant.txt text file in the assign7_programs folder. The leftmost column lists a set of prices, the middle column lists the number of seats (where 10 seats means seating for 10 couples) available at that price, and the third column lists the demand from the hungry couples of Wellesley.

 15     0     250
 20    30     200
 30    60     140
 40    75      60
 50    90      50
 75   125      40
100   150      15

The iphone.txt file (shown in the box below) contains supply and demand data for iphones from different regions of the world.

iphone supply and demand data  
fabricated by Sohie, cs112 April 2010
Supply and Demand Quantities in Thousands (x10^3)
Price      Supply Demand(USA)  Demand(Europe)  Demand(Asia)  Demand(Canada)
$1,500.00   800       2.980         5.808         3.864          3.765
$900.00     650       5.767         8.649         9.986          6.755
$500.00     500      35.876        24.855        29.544         28.087
$400.00     225      45.645        30.786        67.211         45.775
$350.00     100      90.656        55.551       106.656         80.099
$250.00      50     120.771       126.191       232.799        129.632
$100.00      20     223.721       246.687       356.053        145.997
$55.00        1     523.875       364.866       467.524        272.075

Notes about these text files:

textread is how MATLAB reads text from files. Click here for lab textread pointers (Note: in future versions of MATLAB, textscan will replace textread as the preferred function for reading text files)
restaurant.txt contains only numerical data, whereas iphone.txt contains headers as well as non-numerical characters (e.g., dollar signs and commas). Hints on handling non-numerical characters:
- To ignore the leading dollar sign ($), use a format string like this: $%f
- To handle commas, you must read in the number as a string (%s) and then convert using str2double
Just as the six columns of demand quantities stored in demandData.txt were summed to produce a cumulative quantity value for each price, the demand quantities of iphone.txt should be summed across the four areas as well
Recall that loadData returns the four vectors of supply/demand prices and quantities

Part 2: Writing the results to a text file

Modify the supplyDemand script to write the results of analyzing the supply and demand data to a text file. Add code to prompt the user for an output file name, open this file, write results to this file as the user explores the interaction of supply and demand, and close the file at the end. The first line of the output text file should contain a date and time stamp (explore date and clock in MATLAB's help). The remaining lines of the file store the original data (prices, supply and cumulative demand quantities), as well as the equilibrium price and quantity. If the user explores multiple shifts in supply or demand, those are all written to the same output file (your particular equilibrium values may differ slightly from those shown).

MATLAB command window snapshot*

*red text added to highlight file names from user

EDU>> supplyDemand
:: Welcome to the CS112 Supply and Demand Version 2.0 program! 
:: Select a data source, view supply and demand curves, 
:: see the equilibrium price and quantity, and explore 
:: how these values change with supply and demand 
select the data to analyze: mathworks (1), widget (2), file (3): 3
Please enter the filename ==> restaurants.txt
Reading in data from restaurants.txt
  Please type your output file name => restaurantOut.txt
keep current display? yes (1) no (0): 1
Equilibrium price: $44.2507
Equilibrium quantity: 79.8298
specify the change in supply or demand as a fraction of the
maximum quantity present in the current supply or demand curves
change in supply (-0.5 to 0.5): 0.2
change in demand (-0.5 to 0.5): 0.0
keep current display? yes (1) no (0): 1
Equilibrium price: $36.4401
Equilibrium quantity: 95.4505
keep going? yes (1), no(0): 1
change in supply (-0.5 to 0.5): 0.4
change in demand (-0.5 to 0.5): 0.2
keep current display? yes (1) no (0): 1
Equilibrium price: $41.5196
Equilibrium quantity: 134.955
keep going? yes (1), no(0): 0
EDU>>

restaurantOut.txt (created by interaction above)

31-Mar-2010 23:04
Supply Price Demand Price     Demand     Supply
          15           15     250.00       0.00 
          20           20     200.00      30.00 
          30           30     140.00      60.00 
          40           40      60.00      60.00 
          50           50      50.00      90.00 
          75           75      40.00     125.00 
         100          100      15.00     150.00 

Estimated equilibrium price: $44.25
Estimated equilibrium quantity:    80
** Change in supply: 0.20   
** Change in demand:  0.00
Estimated equilibrium price: $36.44
Estimated equilibrium quantity: 95
** Change in supply: 0.40   
** Change in demand:  0.20
Estimated equilibrium price: $41.52
Estimated equilibrium quantity: 135

More sample output

Problem 1: The mystery of life

A DNA molecule is a sequence of nucleotides. The exact order of the nucleotides determines the code of each gene. DNA is transcribed to RNA (which, like DNA, is a sequence of nucleotides), and then the RNA is translated into a protein. Each set of three contiguous RNA nucleotides codes for a single amino acid. The protein is made of a chain of amino acids hooked together. Here is a link to a site with more background information.

How does RNA specify the amino acid sequence?
There are 4 nucleotides and 20 amino acids. Each amino acid is specified by a particular triplet of nucleotides, called a codon. The four nucleotides are represented by A, C, G and U (standing for adenine, cytosine, guanine and uracil, respectively). The 20 amino acids are abbreviated as Phe, Ser, Gly, etc. There are three codons (UAA, UAG and UGA) that act as signals to terminate translation, and these are called STOP codons.

Given an RNA nucleotide sequence, we can calculate the amino acid sequence of the resulting protein, reading off one codon at a time from the RNA. For example, 'GUCACCUAA' would translate into ValThrStop. The table that translates from a triplet of nucleotides (a codon) to one amino acid is given below:

first position second position third position

U C A G

U Phe Ser Tyr Cys U

U Phe Ser Tyr Cys C

U Leu Ser Stop Stop A

U Leu Ser Stop Trp G

C Leu Pro His Arg U

C Leu Pro His Arg C

C Leu Pro Gln Arg A

C Leu Pro Gln Arg G

A Ile Thr Asn Ser U

A Ile Thr Asn Ser C

A Ile Thr Lys Arg A

A Met Thr Lys Arg G

G Val Ala Asp Gly U

G Val Ala Asp Gly C

G Val Ala Glu Gly A

G Val Ala Glu Gly G

Translating codons

... One approach: A smaller example

One subproblem of this task is to translate a codon (e.g. 'GUC') into an amino acid (e.g. Val) by using the table above.

One method is to use what we call brute force. The table above gives all the possible combinations of codons mapped to amino acids, and a loooooong conditional statement could perform that mapping from codon to amino acid.

Another more elegant approach is to create an index into the table above. To illustrate the basic logic, imagine that there are 18 students in CS112, with 9 students in each of two labs that are held at 8am and 9am on Wednesdays. Due to disciplinary issues in class, a seating chart had to be created, shown below:

lab seat row

A B C

8 'ClaraB' 'Ewelina' 'Tiffany' 1

8 'Sarah' 'Jenny' 'Marken' 2

8 'Simone' 'Michelle' 'Leslie' 3

9 'Victoria' 'Christina' 'Jessica' 1

9 'Rifaiyat' 'Harriet' 'ClaraW' 2

9 'Serena' 'Lily' 'Jon Bon Jovi' 3

For example, '8C2' gives Marken's assigned seat for the rest of the semester (8am lab, seat C in the second row). There is a mapping that exists from lab time, seat and row to CS112 student. One method of simplifying the mapping is to collapse our table into one long line of data, and then use an index to access the data. We have three aspects of our seating chart: lab time, seat and row. Suppose we assign lab time 8am to 0 and lab time 9am to 1; seats A, B and C to 0, 1 and 2, respectively, and leave row intact. Then we can figure out who is in which seat by the following formula:

studentindex = 3*time + 6*seat + row

Let's look at the formula more carefully.
Why are we multiplying time by 3?
And why are we multiplying seat by 6?
If we collapse all the names into one long cell array by concatenating the columns like this:
names = {'ClaraB' 'Sarah' 'Simone' 'Victoria' 'Rifaiyat' 'Serena' 'Ewelina' 'Jenny' 'Michelle' ...}
then we can use studentindex to index our names cell array and retrieve the student's name, given her seat assignment.

This seating chart example is intended to serve as a guideline if you choose to translate from RNA nucleotides into amino acids by using indexing, rather than brute force conditional statements.

... Another approach: Number System base 4

The problem of indexing into the amino acid table can be seen as an application of a numerical system of base 4: There are four symbols used in this system (U, C, A, and G). These symbols are combined in different ways, to create sequencies of length 3, known as "codons". We are interested in the decimal value of such a sequence, so we can use it to index into the table.

As a parallel, let's take a look at the number system we use in our everyday lives, which is base 10: We have 10 symbols (0, 1 ,2 ,..., 9), which are combined to create sequences, of any length, also known as numbers, each of them having a value. In particular, here is an example of how the value of a base-10 number can be calculated:

357 --> 3*100 + 5*10 + 7*1, which can be seen as:
357 --> 3*10^2 + 5*10^1 + 7*10^0

Going back to the amino acid table, and the base-4 system, you can apply similar logic to get the decimal value of a codon. First we need to map the codon (a sequence of 3 letters, U, C, A, G) into a numeric sequence:
U --> 0
C --> 1
A --> 2, and
G --> 3.

Now we can find the decimal value of such a sequence, as we did with the decimal number above. Here is an example:
GAC --> 321 --> (most sigificant digit * 4^2) + (second most significant digit * 4^1) + (least significant digit * 4^0) --> 2*16 + 3*4 + 1*1 --> 45

Notice that according to the way the amino acids are placed in the given table, the most significant digit is not the left-most one - as we are used to in the decimal system - but the middle one. Also, because MATLAB starts counting at 1, as opposed to 0, we need to add 1 to the above result, before we use it to index into the amino acid table.

Writing your MATLAB program

The assign7_programs folder contains a subfolder named Bio with two files for this problem. The script named createAminoTable.m creates a variable named aminoNames, which is a cell array with all the amino acid names that appear in the table provided earlier. The file sequences.txt contains some nucleotide sequences to use for testing. As you write your code for this problem, consider breaking it into smaller parts, each of which can be implemented separately. For example, you could begin by writing (and testing) code for each of the following smaller parts:

translate a single codon into the corresponding amino acid, i.e. GUA --> Val
translate a whole nucleotide sequence into a sequence of amino acids, i.e. GUCACCUAA --> Val Thr Stop

Write a script file called translateRNA.m that reads in the sample nucleotide sequences contained in sequences.txt using textread, and then steps through the sample sequences and prints the translation of each sequence into amino acids. Below is some sample MATLAB output from translateRNA:

>> translateRNA
sequence 1: Val Thr Stop
sequence 2: Ala Leu Cys 
sequence 3: Ile Met Ala Trp Thr StopLys 
sequence 4: Tyr Leu Ser Ile Tyr Leu Ser Ile 
sequence 5: Leu Tyr StopSer Leu StopGln 
sequence 6: Gln Thr Val Glu Arg Ala Leu 
sequence 7: Arg Cys Arg Ala Thr Leu Arg Val Ser 
>>

Problem 2: Parlez-vous Francais?

The frequency of occurrence of different letters of the alphabet varies across languages and can be used to identify the language in which a particular selection of text is written. The following tables show some data on the frequency of occurrence (listed as percentages) of the nine most common letters in six languages:

English

German

Finnish

French

Italian

Spanish

e	12.31
t	9.59
a	8.05
o	7.94
n	7.19
i	7.18
s	6.59
r	6.03
h	5.14

e	18.46
n	11.42
i	8.02
r	7.14
s	7.04
a	5.38
t	5.22
u	5.01
d	4.94

a	12.06
i	10.59
t	9.76
n	8.64
e	8.11
s	7.83
l	5.86
o	5.54
k	5.20

e	15.87
a	9.42
i	8.41
s	7.90
t	7.26
n	7.15
r	6.46
u	6.24
l	5.34

e	11.79
a	11.74
i	11.28
o	9.83
n	6.88
l	6.51
r	6.37
t	5.62
s	4.98

e	13.15
a	12.69
o	9.49
s	7.60
n	6.95
r	6.25
i	6.25
l	5.94
d	5.58

In this problem, you will complete a program that uses the above data to identify the language in which a sentence is written. There are two code files for this problem in the Languages subfolder in the assign7_programs folder:

The setupTables.m script file constructs a cell array named languages that stores the information in the above tables. The languages cell array consists of 6 nested cell arrays that each contain three elements: the name of the language, a string of the 9 most common letters, and a vector of the expected frequencies of occurrence of these 9 letters.
The testSentences.m script file contains testing code for your program. This file first constructs 6 strings of letters that are not real sentences, but just contain the right proportion of the most common letters for each language. These strings are just for testing whether your code is working ok. The file then creates a real test sentence from each language. The findLanguage function is then called with each of the test sentences.

To complete this program, you should:

first write the findLanguage function. Think about the appropriate input(s) and output of this function. For each language, this function should first count the number of occurrences in the examined string, of the 9 most common letters for each language. From these counts, you can then determine the frequency of occurrence of each of these 9 letters in the examined string. The observed frequencies of occurrence can be compared to the expected frequencies for each language, to determine how well the input sentence fits the expected data for the language. This last step can be accomplished by calculating the Χ² (Chi-Squared) statistic between the observed and expected frequencies, described in the note below. The most likely language for the input sentence is the one with the smallest value for the Χ² statistic.

When you have completed the definition of the findLanguage function, run the testing code in the testSentences script to identify the languages for the tests defined there. You are also encouraged to perform your own testing, either by adding examples in the testing file, or running examples from the Command Window.

Note: The Χ² (Chi-Squared) statistic

Suppose you are given a vector E of the expected frequencies of occurrence of particular events (in this case, the appearance of certain letters in a text string) and a second vector O that contains the observed frequencies of occurrence. The Χ² statistic captures the difference between O and E, and is measured as follows:

Χ² = Σ (O_i - E_i)²/E_i

where the sum is taken over the set of frequencies.

first position	second position				third position
	U	C	A	G
U	Phe	Ser	Tyr	Cys	U
U	Phe	Ser	Tyr	Cys	C
U	Leu	Ser	Stop	Stop	A
U	Leu	Ser	Stop	Trp	G
C	Leu	Pro	His	Arg	U
C	Leu	Pro	His	Arg	C
C	Leu	Pro	Gln	Arg	A
C	Leu	Pro	Gln	Arg	G
A	Ile	Thr	Asn	Ser	U
A	Ile	Thr	Asn	Ser	C
A	Ile	Thr	Lys	Arg	A
A	Met	Thr	Lys	Arg	G
G	Val	Ala	Asp	Gly	U
G	Val	Ala	Asp	Gly	C
G	Val	Ala	Glu	Gly	A
G	Val	Ala	Glu	Gly	G

lab	seat			row
	A	B	C
8	'ClaraB'	'Ewelina'	'Tiffany'	1
8	'Sarah'	'Jenny'	'Marken'	2
8	'Simone'	'Michelle'	'Leslie'	3
9	'Victoria'	'Christina'	'Jessica'	1
9	'Rifaiyat'	'Harriet'	'ClaraW'	2
9	'Serena'	'Lily'	'Jon Bon Jovi'	3