CS 112

Assignment 7

Due: Thursday, April 22, at the start of class

You can turn in your assignment up until 5:00pm on 4/22/10 without penalty, but it is best to hand in the assignment at the beginning of class. Your hardcopy submission should include a cover sheet and printouts of seven code files: supplyDemand.m, loadData.m, translatecodon.m, rna2amino.m, translateRNA.m, whichLanguage.m and chiSquare.m (you can combine your printouts into one file to save paper). Your electronic submission is described in the section Uploading your saved work

Reading

The following material in the text is useful to review for this assignment: pages 93-100. You should also review notes and examples from Lectures #16-18 and Lab #8.

Getting Started: Download assign7_programs from cs112d

Use Fetch or WinSCP to connect to the CS server using the cs112d account and download a copy of the assign7_programs folder from the cs112d directory onto your Desktop. Rename the folder to be yours, e.g. sohie_assign7_programs. In MATLAB, set the Current Directory to your assign7_programs folder.

Uploading your saved work

Use Fetch or WinSCP again to upload your saved work, but this time you should connect to the CS file server using your personal user account name and password. After logging in to your account:

When you are done with this assignment, you should have at least the following code files that you wrote or modified, stored in your assign7_programs folder: supplyDemand.m, loadData.m, translatecodon.m, rna2amino.m, translateRNA.m, whichLanguage.m and chiSquare.m.

Exercise: ECON 101 - Supply and Demand

We're going to revisit the supply and demand problem from Assignment 5. You can start with your own solutions or the course solutions (contained in the SupplyDemand_revisited subfolder inside the assign7_programs folder), which include:

In this exercise, you'll extend the code to incorporate file input and output. Specifically, your program will be able to 1) read supply and demand data from a text file and 2) also write the results to an output text file.

Part 1: Reading supply and demand data from a text file

Modify the loadData function so that the user is given a third option of reading the supply and demand data from a text file. If the user selects this option, your code should prompt the user for the name of the text file (examples are given below). There are two sample text files in the assign7_programs folder: restaurant.txt and iphone.txt.

The restaurant.txt file contains supply and demand data for dining at five restaurants in Wellesley. Each of the five restaurants can seat 30 couples. Below are the contents of the restaurant.txt text file in the assign7_programs folder. The leftmost column lists a set of prices, the middle column lists the number of seats (where 10 seats means seating for 10 couples) available at that price, and the third column lists the demand from the hungry couples of Wellesley.

 15     0     250
 20    30     200
 30    60     140
 40    75      60
 50    90      50
 75   125      40
100   150      15

The iphone.txt file (shown in the box below) contains supply and demand data for iphones from different regions of the world.

iphone supply and demand data  
fabricated by Sohie, cs112 April 2010
Supply and Demand Quantities in Thousands (x10^3)
Price      Supply Demand(USA)  Demand(Europe)  Demand(Asia)  Demand(Canada)
$1,500.00   800       2.980         5.808         3.864          3.765
$900.00     650       5.767         8.649         9.986          6.755
$500.00     500      35.876        24.855        29.544         28.087
$400.00     225      45.645        30.786        67.211         45.775
$350.00     100      90.656        55.551       106.656         80.099
$250.00      50     120.771       126.191       232.799        129.632
$100.00      20     223.721       246.687       356.053        145.997
$55.00        1     523.875       364.866       467.524        272.075
Notes about these text files:
  1. textread is how MATLAB reads text from files. Click here for lab textread pointers (Note: in future versions of MATLAB, textscan will replace textread as the preferred function for reading text files)
  2. restaurant.txt contains only numerical data, whereas iphone.txt contains headers as well as non-numerical characters (e.g., dollar signs and commas). Hints on handling non-numerical characters:
  3. Just as the six columns of demand quantities stored in demandData.txt were summed to produce a cumulative quantity value for each price, the demand quantities of iphone.txt should be summed across the four areas as well
  4. Recall that loadData returns the four vectors of supply/demand prices and quantities

Part 2: Writing the results to a text file

Modify the supplyDemand script to write the results of analyzing the supply and demand data to a text file. Add code to prompt the user for an output file name, open this file, write results to this file as the user explores the interaction of supply and demand, and close the file at the end. The first line of the output text file should contain a date and time stamp (explore date and clock in MATLAB's help). The remaining lines of the file store the original data (prices, supply and cumulative demand quantities), as well as the equilibrium price and quantity. If the user explores multiple shifts in supply or demand, those are all written to the same output file (your particular equilibrium values may differ slightly from those shown).

MATLAB command window snapshot*

*red text added to highlight file names from user
EDU>> supplyDemand
:: Welcome to the CS112 Supply and Demand Version 2.0 program! 
:: Select a data source, view supply and demand curves, 
:: see the equilibrium price and quantity, and explore 
:: how these values change with supply and demand 
select the data to analyze: mathworks (1), widget (2), file (3): 3
Please enter the filename ==> restaurants.txt
Reading in data from restaurants.txt
  Please type your output file name => restaurantOut.txt
keep current display? yes (1) no (0): 1
Equilibrium price: $44.2507
Equilibrium quantity: 79.8298
specify the change in supply or demand as a fraction of the
maximum quantity present in the current supply or demand curves
change in supply (-0.5 to 0.5): 0.2
change in demand (-0.5 to 0.5): 0.0
keep current display? yes (1) no (0): 1
Equilibrium price: $36.4401
Equilibrium quantity: 95.4505
keep going? yes (1), no(0): 1
change in supply (-0.5 to 0.5): 0.4
change in demand (-0.5 to 0.5): 0.2
keep current display? yes (1) no (0): 1
Equilibrium price: $41.5196
Equilibrium quantity: 134.955
keep going? yes (1), no(0): 0
EDU>> 

restaurantOut.txt (created by interaction above)

31-Mar-2010 23:04
Supply Price Demand Price     Demand     Supply
          15           15     250.00       0.00 
          20           20     200.00      30.00 
          30           30     140.00      60.00 
          40           40      60.00      60.00 
          50           50      50.00      90.00 
          75           75      40.00     125.00 
         100          100      15.00     150.00 

Estimated equilibrium price: $44.25
Estimated equilibrium quantity:    80
** Change in supply: 0.20   
** Change in demand:  0.00
Estimated equilibrium price: $36.44
Estimated equilibrium quantity: 95
** Change in supply: 0.40   
** Change in demand:  0.20
Estimated equilibrium price: $41.52
Estimated equilibrium quantity: 135

More sample output

Problem 1: The mystery of life

A DNA molecule is a sequence of nucleotides. The exact order of the nucleotides determines the code of each gene. DNA is transcribed to RNA (which, like DNA, is a sequence of nucleotides), and then the RNA is translated into a protein. Each set of three contiguous RNA nucleotides codes for a single amino acid. The protein is made of a chain of amino acids hooked together. Here is a link to a site with more background information.

How does RNA specify the amino acid sequence?
There are 4 nucleotides and 20 amino acids. Each amino acid is specified by a particular triplet of nucleotides, called a codon. The four nucleotides are represented by A, C, G and U (standing for adenine, cytosine, guanine and uracil, respectively). The 20 amino acids are abbreviated as Phe, Ser, Gly, etc. There are three codons (UAA, UAG and UGA) that act as signals to terminate translation, and these are called STOP codons.

Given an RNA nucleotide sequence, we can calculate the amino acid sequence of the resulting protein, reading off one codon at a time from the RNA. For example, 'GUCACCUAA' would translate into ValThrStop. The table that translates from a triplet of nucleotides (a codon) to one amino acid is given below:

first position second position third position
  U C A G  
U Phe Ser Tyr Cys U
U Phe Ser Tyr Cys C
U Leu Ser Stop Stop A
U Leu Ser Stop Trp G
C Leu Pro His Arg U
C Leu Pro His Arg C
C Leu Pro Gln Arg A
C Leu Pro Gln Arg G
A Ile Thr Asn Ser U
A Ile Thr Asn Ser C
A Ile Thr Lys Arg A
A Met Thr Lys Arg G
G Val Ala Asp Gly U
G Val Ala Asp Gly C
G Val Ala Glu Gly A
G Val Ala Glu Gly G

Translating codons

... One approach: A smaller example

One subproblem of this task is to translate a codon (e.g. 'GUC') into an amino acid (e.g. Val) by using the table above.

This seating chart example is intended to serve as a guideline if you choose to translate from RNA nucleotides into amino acids by using indexing, rather than brute force conditional statements.

... Another approach: Number System base 4

The problem of indexing into the amino acid table can be seen as an application of a numerical system of base 4: There are four symbols used in this system (U, C, A, and G). These symbols are combined in different ways, to create sequencies of length 3, known as "codons". We are interested in the decimal value of such a sequence, so we can use it to index into the table.

As a parallel, let's take a look at the number system we use in our everyday lives, which is base 10: We have 10 symbols (0, 1 ,2 ,..., 9), which are combined to create sequences, of any length, also known as numbers, each of them having a value. In particular, here is an example of how the value of a base-10 number can be calculated:

357 --> 3*100 + 5*10 + 7*1, which can be seen as:
357 --> 3*10^2 + 5*10^1 + 7*10^0

Going back to the amino acid table, and the base-4 system, you can apply similar logic to get the decimal value of a codon. First we need to map the codon (a sequence of 3 letters, U, C, A, G) into a numeric sequence:
U --> 0
C --> 1
A --> 2, and
G --> 3.

Now we can find the decimal value of such a sequence, as we did with the decimal number above. Here is an example:
GAC --> 321 --> (most sigificant digit * 4^2) + (second most significant digit * 4^1) + (least significant digit * 4^0) --> 2*16 + 3*4 + 1*1 --> 45

Notice that according to the way the amino acids are placed in the given table, the most significant digit is not the left-most one - as we are used to in the decimal system - but the middle one. Also, because MATLAB starts counting at 1, as opposed to 0, we need to add 1 to the above result, before we use it to index into the amino acid table.

Writing your MATLAB program

The assign7_programs folder contains a subfolder named Bio with two files for this problem. The script named createAminoTable.m creates a variable named aminoNames, which is a cell array with all the amino acid names that appear in the table provided earlier. The file sequences.txt contains some nucleotide sequences to use for testing. As you write your code for this problem, consider breaking it into smaller parts, each of which can be implemented separately. For example, you could begin by writing (and testing) code for each of the following smaller parts:

Write a script file called translateRNA.m that reads in the sample nucleotide sequences contained in sequences.txt using textread, and then steps through the sample sequences and prints the translation of each sequence into amino acids. Below is some sample MATLAB output from translateRNA:

>> translateRNA
sequence 1: Val Thr Stop
sequence 2: Ala Leu Cys 
sequence 3: Ile Met Ala Trp Thr StopLys 
sequence 4: Tyr Leu Ser Ile Tyr Leu Ser Ile 
sequence 5: Leu Tyr StopSer Leu StopGln 
sequence 6: Gln Thr Val Glu Arg Ala Leu 
sequence 7: Arg Cys Arg Ala Thr Leu Arg Val Ser 
>> 

Problem 2: Parlez-vous Francais?

The frequency of occurrence of different letters of the alphabet varies across languages and can be used to identify the language in which a particular selection of text is written. The following tables show some data on the frequency of occurrence (listed as percentages) of the nine most common letters in six languages:

  English  German  Finnish   French  Italian  Spanish
e12.31
t9.59
a8.05
o7.94
n7.19
i7.18
s6.59
r6.03
h5.14
e18.46
n11.42
i8.02
r7.14
s7.04
a5.38
t5.22
u5.01
d4.94
a12.06
i10.59
t9.76
n8.64
e8.11
s7.83
l5.86
o5.54
k5.20
e15.87
a9.42
i8.41
s7.90
t7.26
n7.15
r6.46
u6.24
l5.34
e11.79
a11.74
i11.28
o9.83
n6.88
l6.51
r6.37
t5.62
s4.98
e13.15
a12.69
o9.49
s7.60
n6.95
r6.25
i6.25
l5.94
d5.58

In this problem, you will complete a program that uses the above data to identify the language in which a sentence is written. There are two code files for this problem in the Languages subfolder in the assign7_programs folder:

To complete this program, you should:

Note: The Χ2 (Chi-Squared) statistic

Suppose you are given a vector E of the expected frequencies of occurrence of particular events (in this case, the appearance of certain letters in a text string) and a second vector O that contains the observed frequencies of occurrence. The Χ2 statistic captures the difference between O and E, and is measured as follows:

Χ2 = Σ (Oi - Ei)2/Ei

where the sum is taken over the set of frequencies.