CS313 Project 2

How to turn in this Project

You are required to turn in both a hardcopy and a softcopy. If working as a team, the team should submit a single hardcopy and a single softcopy (to either team member's account). Please make sure to keep a copy of your work, on your computer, in your private directory (or, to play it safe, both).

Hardcopy Submission

Your hardcopy packet should consist of:

The cover page;
Your newly developed ORF_Finder.java file from Task 1.
Your newly developed Matrix.java file from Task 2.

Staple these together and submit them on the due date.

Softcopy Submission

You should submit your final version of your ORF_Finder directory and your Matrix directory to the drop/project2 directory in your account on the CS server.

Last week, you helped your friend out by creating a software tool that facilitated completion of her genomics homework. This week, it seems your friend's entire biology class has come by looking for your help to automate some of their data analysis. As part of a class project, they are analyzing data from a newly sequenced genome. They are trying to identify novel ORFs and motifs in the new genomic sequence. Don't they know you have your own work to do. Ah, the curse of competence! Well, you say to yourself, the cat's out of the bag. Then you think, did I really just say that to myself? You make a mental note to cut back on the kitten videos on YouTube.

Task 1: Finding ORFs

In this task, you will search for potential genes in the DNA from yeast chromosome #7. This chromosome contains 527 putative protein-coding genes. Almost all of these genes correspond to an ORF (open reading frame). An ORF is a genomic sequence beginning with a start codon and containing only one in-frame stop codon, which occurs at the end of the sequence. A stop codon is in-frame if it is a multiple of three nucleotides downstream of the start codon. For example, the sequence ATGCATCGTAGCTAG is an ORF because it begins with ATG and ends with an in-frame stop codon, TAG, with no other in-frame stop codons appearing in the sequence. In contrast, the sequence ATGGATCTAG is not an ORF because the stop codon TAG is not in-frame with the start codon. Similarly, the sequence ATGACCTAGGGTTAG is not an ORF (though it contains an ORF within the sequence) because there is an in-frame stop codon TAG occurring before the end of the sequence. It is also important to note that ORFs can occur within other ORFs. For example, the sequence ATGCATGCTAGCATGCTAGCATGCTAG contains three ORFs: ATGCATGCTAGCATGCTAGCATGCTAG, ATGCTAGCATGCTAG, and ATGCTAGCATGCTAG. When searching for genes in a genomic sequence, it is useful to identify ORFs because, although there are many more ORFs than genes in genomes, most genes correspond to an ORF.

In this task, you will identify ORFs that evince properties common to known yeast genes. To begin, download the genomic sequence for yeast chromosome #7 in Fasta format.

Note: the SequenceOps class that you developed in Project 1 will be useful in completing this task. Don't be shy about using your previously developed code when programming.

Create a directory named ORF_Finder containing a Java file ORF_Finder.java. Your ORF_Finder class should meet the following specifications.

A genomic sequence should be read in from a Fasta formatted file (e.g., yeast7.txt) specified by a command line argument.

The main method should print out the number of ORFs in the read-in genomic sequence. (Note: it is ok to compute the number of ORFs using a quadratic time algorithm, O(n^2), but if you really want to impress your professor, see if you can come up with a linear time algorithm, O(n).)

Most yeast proteins are at least 100 amino acids in length and, thus, most yeast genes are at least 300 nucleotides in length. The main method should print out the number of ORFs in the read-in genomic sequence that are at least 300 nucleotides in length.

Cysteine and tryptophan are amino acids that occur infrequently in most yeast proteins whereas alanine and leucine are amino acids that occur more frequently in most yeast proteins. The main method should print out the number of ORFs in the read-in genomic sequence that would code for proteins containing less than 4% cysteines and less than 4% tryptophans and more than 6% alanines and more than 6% leucines.

A Kozak sequence is a sequence of nucleotides surrounding a start codon that facilitates ribosome binding. A Kozak sequence can be expressed as RCCATGG, where the character R represents any purine, either adanine or guanine. The Kozak sequence encompasses the first codon of a gene sequence, i.e., the ATG in the Kozak sequence should correspond to a start codon. The main method should print out the number of ORFs in the read-in genomic sequence for which there is a corresponding Kozak sequence.

Some codons are used more than others when coding for the same amino acid. For example, both AAA and AAG code for the amino acid lysine, but in yeast genes, lysine is coded by AAA approximately 58% of the time and by AAG approximately 42% of the time. In contrast, in intergenic regions of the yeast genome, AAA accounts for approximately 70% and AAG accounts for approximately 30% of occurrences of either AAA or AAG. The file codonUsage.txt indicates how often various codons are used to code for their corresponding amino acid in yeast protein coding sequences and in yeast non-coding sequences. In other words, nucleotide triplets occur at different rates in protein-coding sequences than in non-coding sequences. The abundance of different tRNA molecules is one factor that can contribute to an organism's bias for or against particular codons in protein-coding sequences. These biases can be used to help identify genes, e.g., we may hypothesize that an ORF with a ratio of AAA to AAG of 58:42 is more likely to correspond to a protein-coding gene than an ORF with a ratio of 70:30. The probabilities in codonUsage.txt can be used to help distinguish coding ORFs from spurious ORFs by comparing the product of coding probabilities for each codon in the ORF to the product of noncoding probabilities for each codon in the ORF. For example, the probability that the ORF ATGCCAATTTAA is coding vs. noncoding based on codon usage can be expressed as:
```
	ProbabilityCoding(ATGCCAATTTAA) = ProbabilityCoding(ATG) * ProbabilityCoding(CCA) *
	                                  ProbabilityCoding(ATT) * ProbabilityCoding(TAA) 
					≈ 1.0 * 0.416 * 0.464 * 0.424
					≈ 0.082
  
```
and
```
	ProbabilityNoncoding(ATGCCAATTTAA) = ProbabilityNoncoding(ATG) * ProbabilityNoncoding(CCA) *
	                                     ProbabilityNoncoding(ATT) * ProbabilityNoncoding(TAA) 
					   ≈ 1.0 * 0.316 * 0.396 * 0.455
					   ≈ 0.057
  
```
In this case, ProbabilityCoding(ATGCCAATTTAA) is larger than ProbabilityNoncoding(ATGCCAATTTAA) so we hypothesize that the ORF ATGCCAATTTAA is more likely to correspond to a coding sequence than a noncoding sequence. Based on the codon usage probabilities found in codonUsage.txt, the main method should print out the number of ORFs in the read-in genomic sequence for which the probability of the ORF corresponding to a coding sequence is greater than the probability of the ORF corresponding to a noncoding sequence.

For comparison, your ORF_Finder program should generate a random DNA nucleotide sequence with the same length and same expected GC content as the read-in genomic sequence. The main method should print out for the random sequence the number of ORFs, the number of ORFs at least 300 nucleotides in length, the number of ORFs that would code for proteins containing less than 4% cysteines and less than 4% tryptophans and more than 6% alanines and more than 6% leucines, the number of ORFs for which there is a corresponding Kozak sequence, and the number of ORFs that are more likely to correspond to a coding sequence than a noncoding sequences based on codon usage statistics.

Finally, ensure your program is able to execute on the genomic sequence from yeast chromosome #7 (as the professor will be testing your program on this file).

As examples, when executing your program with the file shortGenomicSequence.txt as a command line argument, your program might output the following:

[cs313@CS ORF_Finder] java ORF_Finder shortGenomicSequence.txt

**************************************************************
***** Genomic Sequence Read-In From Fasta Formatted File *****
**************************************************************
Number of ORFs is:      3
Number of ORFs at least 300nt is:       0
Number of ORFs with specified amino acid distribution is:       0
Number of ORFs with Kozak sequence is:  1
Number of ORFs with likely protein coding codons is:    1

*****************************************************************************************
***** Random Sequence With Same Length And GC-Content As Sequence Read-In From File *****
*****************************************************************************************
Number of ORFs is:      0
Number of ORFs at least 300nt is:       0
Number of ORFs with specified amino acid distribution is:       0
Number of ORFs with Kozak sequence is:  0
Number of ORFs with likely protein coding codons is:    0

When executing your program with the file yeast7.txt as a command line argument, your program might output the following:

[cs313@CS ORF_Finder] java ORF_Finder yeast7.txt

**************************************************************
***** Genomic Sequence Read-In From Fasta Formatted File *****
**************************************************************
Number of ORFs is:      20285
Number of ORFs at least 300nt is:       2587
Number of ORFs with specified amino acid distribution is:       2188
Number of ORFs with Kozak sequence is:  103
Number of ORFs with likely protein coding codons is:    8670

*****************************************************************************************
***** Random Sequence With Same Length And GC-Content As Sequence Read-In From File *****
*****************************************************************************************
Number of ORFs is:      19815
Number of ORFs at least 300nt is:       37
Number of ORFs with specified amino acid distribution is:       1566
Number of ORFs with Kozak sequence is:  69
Number of ORFs with likely protein coding codons is:    7034

Task 2: Motif Instances

TATA boxes and Kozak sequences are examples of motifs found in genomics sequences. Instances of these motifs in a genomic sequence, e.g., TATAAA or ACCATGG respectively, can serve as signals to a cell during important biological processes such as transcription and translation.

When investigating a gene in a genome and how the gene is regulated, it may be useful to identify instances of various motifs for the gene. However, identifying instances of motifs in a genomic sequence is non-trivial. For example, the TATA box for most eukaryotic genes is composed of the following six nucleotides: TATAAA. The most commonly occurring instance of a motif is called the consensus sequence for the motif. But some genes have degenerate TATA boxes that differ from the consensus sequence, such as TATATA or CATAAA. If we only searched a genomic sequence for the consensus sequence of a motif, we would miss other (degenerate) instances of the motif.

How then might we search for an instance of a gene's TATA box, if the instance might differ from the consensus sequence? One approach would be to search for sequences of six nucleotides either that match the consensus sequence, TATAAA, or that differ from the consensus sequence only by 1 or 2 nucleotides. This approach, however, has limitations. All TATA box instances have a thymine nucleotide as the 3rd of the six nucleotides. The 5th of the six nucleotides is an adenine about two-thirds of the time and is a thymine about one-third of the time. Ideally, our approach for identifying instances of motifs should take into account the fact that some positions in the motif may contain more variability (e.g., position 5) than other positions (e.g., position 3).

A weight-matrix is a means for describing a motif that takes into account the nucleotide variability in each position of the motif. For instance, a weight-matrix for the TATA box is given below.

		 1	 2	 3	 4	 5	 6
	A	 4%	90%	 0%	95%	66%	97%
	C	10%	 1%	 0%	 0%	 1%	 0%
	G	 3%	 1%	 0%	 0%	 1%	 3%
	T	83%	 8%    100%	 5%	32%	 0%

The weight-matrix above reflects the fact that for a large number of well studied TATA box instances, the first nucleotide is an adenine 4% of the time, a cytosine 10% of the time, a guanine 3% of the time, and a thymine 83% of the time. Looking at the weight-matrix above for the TATA box motif, we would conclude that the sequence TATAAA better fits the TATA box motif than does the sequence GGGGGG.

More formally, given a weight-matrix for a motif, we can calculate the probability that a given sequence (of the same length as the weight-matrix) corresponds to the motif. The probability that a sequence corresponds to a motif is the product of the frequency of each nucleotide in the given sequence as determined by the weight-matrix. For instance, for the TATA box weight-matrix above, the probability that the sequence TATAAA corresponds to the TATA box motif is:
ProbabilityTATA(TATAAA) = 0.83 * 0.90 * 1.00 * 0.95 * 0.66 * 0.97 ≈ 0.45

The probability that the sequence GGGGGG corresponds to the TATA box motif is:
ProbabilityTATA(GGGGGG) = 0.03 * 0.01 * 0.00 * 0.00 * 0.01 * 0.03 ≈ 0.0

The probability that the sequence CATTTG corresponds to the TATA box motif is:
ProbabilityTATA(CATTTG) = 0.10 * 0.90 * 1.00 * 0.05 * 0.32 * 0.03 ≈ 4.3 x 10^-5

Note: the SequenceOps class that you developed in Project 1 will be useful in completing this task. Don't be shy about using your previously developed code when programming.

Create a directory named Matrix containing a Java file Matrix.java. Your Matrix class should meet the following specifications.

A genomic sequence should be read in from a Fasta formatted file specified by a single command line argument.

The program should search the read-in genomic sequence and identify the hexamer within the sequence that has the highest probability of being an instance of the TATA box motif (based on the weight-matrix given above). The main method should print out the highest scoring hexamer along with its probability.

When assessing how likely it is that a hexamer corresponds to the TATA box motif, it would be useful to know how likely it is that the hexamer occurs by random chance in the genomic sequence. Consider that, in addition to all of the genuine TATA boxes in a genome, there will be many spurious sequences similar to TATA boxes that occur in a genome by random chance. In particular, in a large genomic sequence with a high concentration of adenine and thymine nucleotides, the hexamer TATAAA may occur by random chance. It stands to reason that, in genomes with low GC content, there will be more spurious sequences that are similar to TATA boxes and, in genomes with high GC content, there will be fewer spurious sequences that are similar to TATA boxes. Furthermore, if we observe a sequence such as TATAAA in a genome with high GC content, we may have more confidence that it corresponds to a TATA box than if we observe the same sequence in a genome with low GC content since the sequence is more likely to occur by chance in the low GC content genome.
Here, you must assess how likely it is that a hexamer corresponds to a TATA box by considering how likely it is that the hexamer occurs by chance. In a genome such as that of yeast, with a GC content of 38%, we can use the following background weight-matrix to determine the probability that a hexamer occurs by chance.
```
		 1	 2	 3	 4	 5	 6
	A	31%	31%	31%	31%	31%	31%
	C	19%	19%	19%	19%	19%	19%
	G	19%	19%	19%	19%	19%	19%
	T	31%	31%	31%	31%	31%	31%
  
```
As examples, the probability that the sequence TATAAA occurs by chance is:
           ProbabilityChance(TATAAA) = 0.31 * 0.31 * 0.31 * 0.31 * 0.31 * 0.31 ≈ 8.9 x 10^-4
The probability that the sequence GGGGGG occurs by chance is:
           ProbabilityChance(GGGGGG) = 0.19 * 0.19 * 0.19 * 0.19 * 0.19 * 0.19 ≈ 4.7 x 10^-5
The probability that the sequence CATTTG occurs by chance is:
           ProbabilityChance(CATTTG) = 0.19 * 0.31 * 0.31 * 0.31 * 0.31 * 0.19 ≈ 3.3 x 10^-4
The likelihood of a hexamer, X, corresponding to a TATA box, then, can be described as the probability of the hexamer corresponding to the TATA box weight-matrix divided by the probability of the hexamer corresponding to the background weight-matrix.
           Likelihood = ProbabilityTATA(X) / ProbabilityChance(X)
If the hexamer better fits the TATA box weight-matrix than the background weight-matrix, then the numerator will be larger than the denominator and the likelihood will be greater than 1.0. If the hexamer is more likely to occur by chance than to correspond to a TATA box, then denominator will be larger than the numerator and the likelihood will be less than 1.0.

Finally, the log likelihood of a hexamer, X, corresponding to a TATA box is the natural logarithm of the likelihood of X corresponding to a TATA box.
           LogLikelihood = ln(ProbabilityTATA(X) / ProbabilityChance(X))
If the hexamer better fits the TATA box weight-matrix than the background weight-matrix, then the log likelihood will be larger than 0.0. If the hexamer is more likely to occur by chance than to correspond to a TATA box, then the log likelihood will be less than 0.0.

The program should search the read-in genomic sequence and print out every hexamer in the sequence that is more likely to correspond to a TATA box instance than to occur by chance (based on the two weight-matrices above). In addition to printing out each such hexamer, the log likelihood of each such hexamer should be printed out.

For example, when executing your program with the file sequence.txt as a command line argument, your program might output the following:

[cs313@CS Matrix] java Matrix sequence.txt


Best hexamer is TATATA with probability 0.22027536

Hexamer GATAAA has log ratio 3.4074597555246435
Hexamer GTTAAA has log ratio 0.9870916268742144
Hexamer GATAAA has log ratio 3.4074597555246435
Hexamer TGTAAA has log ratio 2.2278784043228668
Hexamer TATATA has log ratio 5.514221010107727
Hexamer TTTAGA has log ratio 0.11766520397627707