How to turn in this Project
You are required to turn in both a hardcopy and a softcopy.
If working as a team, the team should submit a single hardcopy
and a single softcopy (to either team member's account). 
Please make sure to keep a copy of your work, on your computer, in
your private directory (or, to play it safe, both).  
Hardcopy Submission
Your hardcopy packet should consist of:
  -  The cover page;
  
-  Your newly developed ORF_Finder.javafile from Task 1.
-  Your newly developed Matrix.javafile from Task 2.
Staple these together and submit them on the due date.
Softcopy Submission
You should submit your final version of your
ORF_Finder directory and your
Matrix directory to the drop/project2
directory in your account on the CS server. 
Last week, you helped your friend out by creating a software tool
that facilitated completion of her genomics homework. This week,
it seems your friend's entire biology class has come by looking
for your help to automate some of their data analysis. As part
of a class project, they are analyzing data from a newly sequenced
genome. They are trying to identify novel ORFs and motifs in the
new genomic sequence. Don't they know you have your own work to do.
Ah, the curse of competence! Well, you say to yourself,
the cat's out of the bag. Then you think, did I really just say 
that to myself? You make a mental note to cut back on the kitten 
videos on YouTube.
Task 1: Finding ORFs
In this task, you will search for potential genes in the DNA from
yeast chromosome #7. This chromosome contains 527 putative protein-coding
genes. Almost all of these genes correspond to an ORF (open reading frame).
An ORF is a genomic sequence beginning with a start codon and 
containing only one in-frame stop codon, which occurs at the end of the
sequence. A stop codon is in-frame if it is a multiple of three
nucleotides downstream of the start codon. For example, the sequence
ATGCATCGTAGCTAG is an ORF because it begins with
ATG and ends with an in-frame stop codon, TAG, 
with no other in-frame stop codons appearing in the sequence. In 
contrast, the sequence ATGGATCTAG is not an ORF because
the stop codon TAG is not in-frame with the start codon.
Similarly, the sequence ATGACCTAGGGTTAG is not an ORF
(though it contains an ORF within the sequence) because there is an 
in-frame stop codon TAG occurring before the end of the
sequence. It is also important to note that ORFs can occur within other ORFs.
For example, the sequence ATGCATGCTAGCATGCTAGCATGCTAG
contains three ORFs: ATGCATGCTAGCATGCTAGCATGCTAG, 
ATGCTAGCATGCTAG, and ATGCTAGCATGCTAG.
When searching for genes in a genomic sequence, it is useful
to identify ORFs because, although there are many more ORFs than genes
in genomes, most genes correspond to an ORF.
In this task, you will identify ORFs that evince properties common to
known yeast genes. To begin, download the 
genomic sequence for yeast
chromosome #7 in Fasta format.
Note: the SequenceOps class that you developed in 
Project 1 will be useful in completing this task. Don't be
shy about using your previously developed code when
programming.
Create a directory named ORF_Finder containing
a Java file ORF_Finder.java. Your ORF_Finder
class should meet the following specifications.
  - A genomic sequence should be read in from a Fasta formatted file
  (e.g., yeast7.txt) specified by a command line 
  argument.
  - The mainmethod should print out the number of ORFs in the
  read-in genomic sequence. (Note: it is ok to compute the number of ORFs
  using a quadratic time algorithm, O(n^2), but if you really want to impress your
  professor, see if you can come up with a linear time algorithm, O(n).)
  - Most yeast proteins are at least 100 amino acids in length and, thus,
  most yeast genes are at least 300 nucleotides in length.
  The mainmethod should print out the number of ORFs in the 
  read-in genomic sequence that are at least 300 nucleotides in length.
  - Cysteine and tryptophan are amino acids that occur infrequently in most 
  yeast proteins whereas alanine and leucine are amino acids that occur
  more frequently in most yeast proteins.
  The mainmethod should print out the number of ORFs in the
  read-in genomic sequence that would code for proteins containing less
  than 4% cysteines and less than 4% tryptophans and more than 6%
  alanines and more than 6% leucines.
  - A Kozak sequence is a sequence of nucleotides surrounding a start codon
  that facilitates ribosome binding. A Kozak sequence can be expressed as 
  RCCATGG, where the characterRrepresents
  any purine, either adanine or guanine. The Kozak sequence encompasses
  the first codon of a gene sequence, i.e., theATGin the
  Kozak sequence should correspond to a start codon.
  Themainmethod should print out the number of ORFs in the 
  read-in genomic sequence for which there is a corresponding
  Kozak sequence.
  - Some codons are used more than others when coding for the same amino
  acid. For example, both AAAandAAGcode for
  the amino acid lysine, but in yeast genes, lysine is coded byAAAapproximately 58% of the time and byAAGapproximately 42% of
  the time. In contrast, in intergenic regions of the yeast genome,AAAaccounts for approximately 70% andAAGaccounts
  for approximately 30% of occurrences of eitherAAAorAAG. 
  The file codonUsage.txt indicates how often
  various codons are used to code for their corresponding amino acid in yeast
  protein coding sequences and in yeast non-coding sequences. 
  In other words, nucleotide triplets occur at different rates
  in protein-coding sequences than in non-coding sequences. The abundance of 
  different tRNA molecules is one factor that can contribute to an organism's 
  bias for or against particular codons in protein-coding sequences. These biases
  can be used to help identify genes, e.g., we may hypothesize that an ORF 
  with a ratio ofAAAtoAAGof 58:42 is more likely
  to correspond to a protein-coding gene than an ORF with a ratio of 70:30.
  The probabilities
  in codonUsage.txt can be used to help 
  distinguish coding ORFs from spurious ORFs by comparing the product of
  coding probabilities for each codon in the ORF to the product of
  noncoding probabilities for each codon in the ORF. For example, the probability
  that the ORFATGCCAATTTAAis coding vs. noncoding based on codon usage
  can be expressed as:
 
	ProbabilityCoding(ATGCCAATTTAA) = ProbabilityCoding(ATG) * ProbabilityCoding(CCA) *
	                                  ProbabilityCoding(ATT) * ProbabilityCoding(TAA) 
					≈ 1.0 * 0.416 * 0.464 * 0.424
					≈ 0.082
   and
 
	ProbabilityNoncoding(ATGCCAATTTAA) = ProbabilityNoncoding(ATG) * ProbabilityNoncoding(CCA) *
	                                     ProbabilityNoncoding(ATT) * ProbabilityNoncoding(TAA) 
					   ≈ 1.0 * 0.316 * 0.396 * 0.455
					   ≈ 0.057
   In this case,ProbabilityCoding(ATGCCAATTTAA)is larger thanProbabilityNoncoding(ATGCCAATTTAA)so we hypothesize that the ORFATGCCAATTTAAis more likely to correspond to a coding sequence than
  a noncoding sequence. Based on the codon usage probabilities found in
  codonUsage.txt,
  themainmethod should print out the number of ORFs in the 
  read-in genomic sequence for which the probability of the ORF corresponding to
  a coding sequence is greater than the probability of the ORF corresponding to a
  noncoding sequence.
  - For comparison, your ORF_Finder program should generate a random 
  DNA nucleotide sequence with the same length and same expected GC 
  content as the read-in genomic sequence. 
  The mainmethod should print out for the random sequence 
  the number of ORFs, the number of ORFs at least 300 nucleotides in length,
  the number of ORFs that would code for proteins containing less than
  4% cysteines and less than 4% tryptophans and more than 6% alanines and
  more than 6% leucines, the number of ORFs for which there is a 
  corresponding Kozak sequence, and the number of ORFs that are more
  likely to correspond to a coding sequence than a noncoding sequences based 
  on codon usage statistics.
  - Finally, ensure your program is able to execute on the genomic sequence 
  from yeast chromosome #7 (as the professor 
  will be testing your program on this file).
As examples, when executing your program with the file
shortGenomicSequence.txt
as a command line argument, your program might output the following:
| [cs313@CS ORF_Finder] java ORF_Finder shortGenomicSequence.txt
**************************************************************
***** Genomic Sequence Read-In From Fasta Formatted File *****
**************************************************************
Number of ORFs is:      3
Number of ORFs at least 300nt is:       0
Number of ORFs with specified amino acid distribution is:       0
Number of ORFs with Kozak sequence is:  1
Number of ORFs with likely protein coding codons is:    1
*****************************************************************************************
***** Random Sequence With Same Length And GC-Content As Sequence Read-In From File *****
*****************************************************************************************
Number of ORFs is:      0
Number of ORFs at least 300nt is:       0
Number of ORFs with specified amino acid distribution is:       0
Number of ORFs with Kozak sequence is:  0
Number of ORFs with likely protein coding codons is:    0
 | 
When executing your program with the file
yeast7.txt
as a command line argument, your program might output the following:
| [cs313@CS ORF_Finder] java ORF_Finder yeast7.txt
**************************************************************
***** Genomic Sequence Read-In From Fasta Formatted File *****
**************************************************************
Number of ORFs is:      20285
Number of ORFs at least 300nt is:       2587
Number of ORFs with specified amino acid distribution is:       2188
Number of ORFs with Kozak sequence is:  103
Number of ORFs with likely protein coding codons is:    8670
*****************************************************************************************
***** Random Sequence With Same Length And GC-Content As Sequence Read-In From File *****
*****************************************************************************************
Number of ORFs is:      19815
Number of ORFs at least 300nt is:       37
Number of ORFs with specified amino acid distribution is:       1566
Number of ORFs with Kozak sequence is:  69
Number of ORFs with likely protein coding codons is:    7034
 | 
Task 2: Motif Instances
TATA boxes and Kozak sequences are examples of motifs found in 
genomics sequences. Instances of these motifs in a genomic sequence, 
e.g., TATAAA or ACCATGG respectively, can 
serve as signals to a cell during important biological processes 
such as transcription and translation.
When investigating a gene in a genome and how the gene is regulated, 
it may be useful to identify instances of various motifs for the 
gene. However, identifying instances of motifs in a genomic sequence 
is non-trivial. For example, the TATA box for most eukaryotic genes 
is composed of the following six nucleotides: TATAAA. 
The most commonly occurring instance of a motif is called the consensus 
sequence for the motif. But some genes have degenerate TATA boxes 
that differ from the consensus sequence, such as TATATA 
or CATAAA. If we only searched a genomic sequence for the 
consensus sequence of a motif, we would miss other (degenerate) 
instances of the motif.
How then might we search for an instance of a gene's TATA box, 
if the instance might differ from the consensus sequence? One approach 
would be to search for sequences of six nucleotides either that match 
the consensus sequence, TATAAA, or that differ from the consensus 
sequence only by 1 or 2 nucleotides. This approach, however, has 
limitations. All TATA box instances have a thymine nucleotide as the 
3rd of the six nucleotides. The 5th of the six nucleotides is an 
adenine about two-thirds of the time and is a thymine about one-third 
of the time. Ideally, our approach for identifying instances of motifs 
should take into account the fact that some positions in the motif may 
contain more variability (e.g., position 5) than other positions 
(e.g., position 3).
A weight-matrix is a means for describing a motif that takes into 
account the nucleotide variability in each position of the motif. 
For instance, a weight-matrix for the TATA box is given below.
		 1	 2	 3	 4	 5	 6
	A	 4%	90%	 0%	95%	66%	97%
	C	10%	 1%	 0%	 0%	 1%	 0%
	G	 3%	 1%	 0%	 0%	 1%	 3%
	T	83%	 8%    100%	 5%	32%	 0%
The weight-matrix above reflects the fact that for a large number 
of well studied TATA box instances, the first nucleotide is an 
adenine 4% of the time, a cytosine 10% of the time, a guanine 3% of 
the time, and a thymine 83% of the time. Looking at the weight-matrix 
above for the TATA box motif, we would conclude that the sequence 
TATAAA better fits the TATA box motif than does 
the sequence GGGGGG.
More formally, given a weight-matrix for a motif, we can calculate 
the probability that a given sequence (of the same length as the 
weight-matrix) corresponds to the motif. The probability that a 
sequence corresponds to a motif is the product of the frequency of 
each nucleotide in the given sequence as determined by the weight-matrix. 
For instance, for the TATA box weight-matrix above, the probability that 
the sequence TATAAA corresponds to the TATA box motif is:
          
	ProbabilityTATA(TATAAA) = 0.83 * 0.90 * 1.00 * 0.95 * 0.66 * 0.97 ≈ 0.45
The probability that the sequence GGGGGG corresponds to the TATA box motif is:
          
	ProbabilityTATA(GGGGGG) = 0.03 * 0.01 * 0.00 * 0.00 * 0.01 * 0.03 ≈ 0.0
The probability that the sequence CATTTG corresponds to the TATA box motif is:
          
	ProbabilityTATA(CATTTG) = 0.10 * 0.90 * 1.00 * 0.05 * 0.32 * 0.03 ≈ 4.3 x 10-5
Note: the SequenceOps class that you developed in 
Project 1 will be useful in completing this task. Don't be
shy about using your previously developed code when
programming.
Create a directory named Matrix containing
a Java file Matrix.java. Your Matrix
class should meet the following specifications.
  - A genomic sequence should be read in from a Fasta formatted file
  specified by a single command line argument.
  - The program should search the read-in genomic sequence and identify
  the hexamer within the sequence that has
  the highest probability of being an instance of the
  TATA box motif (based on the weight-matrix given above). The
  mainmethod should print out the highest scoring 
  hexamer along with its probability.
  - When assessing how likely it is that a hexamer corresponds to 
  the TATA box motif, it would be useful to know how likely it is that 
  the hexamer occurs by random chance in the genomic sequence. Consider 
  that, in addition to all of the genuine TATA boxes in a genome, there 
  will be many spurious sequences similar to TATA boxes that occur in a 
  genome by random chance. In particular, in a large genomic sequence 
  with a high concentration of adenine and thymine nucleotides, the 
  hexamer TATAAAmay occur by random chance. It stands to 
  reason that, in genomes with low GC content, there will be more 
  spurious sequences that are similar to TATA boxes and, in genomes 
  with high GC content, there will be fewer spurious sequences that 
  are similar to TATA boxes. Furthermore, if we observe a sequence such 
  asTATAAAin a genome with high GC content, we may 
  have more confidence that it corresponds to a TATA box than if we 
  observe the same sequence in a genome with low GC content since 
  the sequence is more likely to occur by chance in the low GC 
  content genome.Here, you must assess how likely it is that a hexamer corresponds 
  to a TATA box by considering how likely it is that the hexamer 
  occurs by chance.  In a genome such as that of yeast, with a GC 
  content of 38%, we can use the following background weight-matrix 
  to determine the probability that a hexamer occurs by chance.
  
		 1	 2	 3	 4	 5	 6
	A	31%	31%	31%	31%	31%	31%
	C	19%	19%	19%	19%	19%	19%
	G	19%	19%	19%	19%	19%	19%
	T	31%	31%	31%	31%	31%	31%
   
 As examples, the probability that the sequence TATAAAoccurs by chance is:
 ProbabilityChance(TATAAA) = 0.31 * 0.31 * 0.31 * 0.31 * 0.31 * 0.31 ≈ 8.9 x 10-4
 The probability that the sequence GGGGGGoccurs by chance is:
 ProbabilityChance(GGGGGG) = 0.19 * 0.19 * 0.19 * 0.19 * 0.19 * 0.19 ≈ 4.7 x 10-5
 The probability that the sequence CATTTGoccurs by chance is:
 ProbabilityChance(CATTTG) = 0.19 * 0.31 * 0.31 * 0.31 * 0.31 * 0.19 ≈ 3.3 x 10-4
 The likelihood of a hexamer, X, corresponding to a TATA 
  box, then, can be described as the probability of the hexamer 
  corresponding to the TATA box weight-matrix divided by the probability 
  of the hexamer corresponding to the background weight-matrix. 
  
 Likelihood = ProbabilityTATA(X) / ProbabilityChance(X)
 If the hexamer better fits the TATA box weight-matrix than the 
  background weight-matrix, then the numerator will be larger than the 
  denominator and the likelihood will be greater than 1.0. If the hexamer 
  is more likely to occur by chance than to correspond to a TATA box, 
  then denominator will be larger than the numerator and the likelihood 
  will be less than 1.0. Finally, the log likelihood of a hexamer, X, 
  corresponding to a TATA box is the natural logarithm of the 
  likelihood of X corresponding to a TATA box.
  
 LogLikelihood = ln(ProbabilityTATA(X) / ProbabilityChance(X))
 If the hexamer better fits the TATA box weight-matrix than the 
  background weight-matrix, then the log likelihood will be larger than 0.0.
  If the hexamer is more likely to occur by chance than to correspond to a TATA box, 
  then the log likelihood will be less than 0.0. The program should search the read-in genomic sequence and print
  out every hexamer in the sequence that is more likely to correspond
  to a TATA box instance than to occur by chance (based on the two
  weight-matrices above). In addition to printing out each such hexamer,
  the log likelihood of each such hexamer should be printed out. 
For example, when executing your program with the file
sequence.txt
as a command line argument, your program might output the following:
| [cs313@CS Matrix] java Matrix sequence.txt
Best hexamer is TATATA with probability 0.22027536
Hexamer GATAAA has log ratio 3.4074597555246435
Hexamer GTTAAA has log ratio 0.9870916268742144
Hexamer GATAAA has log ratio 3.4074597555246435
Hexamer TGTAAA has log ratio 2.2278784043228668
Hexamer TATATA has log ratio 5.514221010107727
Hexamer TTTAGA has log ratio 0.11766520397627707
 |