CS313

Project 2

How to turn in this Project

You are required to turn in both a hardcopy and a softcopy. If working as a team, the team should submit a single hardcopy and a single softcopy (to either team member's account). Please make sure to keep a copy of your work, on your computer, in your private directory (or, to play it safe, both).

Hardcopy Submission

Your hardcopy packet should consist of:
  1. The cover page;
  2. Your newly developed ORF_Finder.java file from Task 1.
  3. Your newly developed Matrix.java file from Task 2.
Staple these together and submit them on the due date.

Softcopy Submission

You should submit your final version of your ORF_Finder directory and your Matrix directory to the drop/project2 directory in your account on the CS server.


Last week, you helped your friend out by creating a software tool that facilitated completion of her genomics homework. This week, it seems your friend's entire biology class has come by looking for your help to automate some of their data analysis. As part of a class project, they are analyzing data from a newly sequenced genome. They are trying to identify novel ORFs and motifs in the new genomic sequence. Don't they know you have your own work to do. Ah, the curse of competence! Well, you say to yourself, the cat's out of the bag. Then you think, did I really just say that to myself? You make a mental note to cut back on the kitten videos on YouTube.


Task 1: Finding ORFs

In this task, you will search for potential genes in the DNA from yeast chromosome #7. This chromosome contains 527 putative protein-coding genes. Almost all of these genes correspond to an ORF (open reading frame). An ORF is a genomic sequence beginning with a start codon and containing only one in-frame stop codon, which occurs at the end of the sequence. A stop codon is in-frame if it is a multiple of three nucleotides downstream of the start codon. For example, the sequence ATGCATCGTAGCTAG is an ORF because it begins with ATG and ends with an in-frame stop codon, TAG, with no other in-frame stop codons appearing in the sequence. In contrast, the sequence ATGGATCTAG is not an ORF because the stop codon TAG is not in-frame with the start codon. Similarly, the sequence ATGACCTAGGGTTAG is not an ORF (though it contains an ORF within the sequence) because there is an in-frame stop codon TAG occurring before the end of the sequence. It is also important to note that ORFs can occur within other ORFs. For example, the sequence ATGCATGCTAGCATGCTAGCATGCTAG contains three ORFs: ATGCATGCTAGCATGCTAGCATGCTAG, ATGCTAGCATGCTAG, and ATGCTAGCATGCTAG. When searching for genes in a genomic sequence, it is useful to identify ORFs because, although there are many more ORFs than genes in genomes, most genes correspond to an ORF.

In this task, you will identify ORFs that evince properties common to known yeast genes. To begin, download the genomic sequence for yeast chromosome #7 in Fasta format.

Note: the SequenceOps class that you developed in Project 1 will be useful in completing this task. Don't be shy about using your previously developed code when programming.

Create a directory named ORF_Finder containing a Java file ORF_Finder.java. Your ORF_Finder class should meet the following specifications.

As examples, when executing your program with the file shortGenomicSequence.txt as a command line argument, your program might output the following:

[cs313@CS ORF_Finder] java ORF_Finder shortGenomicSequence.txt

**************************************************************
***** Genomic Sequence Read-In From Fasta Formatted File *****
**************************************************************
Number of ORFs is:      3
Number of ORFs at least 300nt is:       0
Number of ORFs with specified amino acid distribution is:       0
Number of ORFs with Kozak sequence is:  1
Number of ORFs with likely protein coding codons is:    1

*****************************************************************************************
***** Random Sequence With Same Length And GC-Content As Sequence Read-In From File *****
*****************************************************************************************
Number of ORFs is:      0
Number of ORFs at least 300nt is:       0
Number of ORFs with specified amino acid distribution is:       0
Number of ORFs with Kozak sequence is:  0
Number of ORFs with likely protein coding codons is:    0

When executing your program with the file yeast7.txt as a command line argument, your program might output the following:

[cs313@CS ORF_Finder] java ORF_Finder yeast7.txt

**************************************************************
***** Genomic Sequence Read-In From Fasta Formatted File *****
**************************************************************
Number of ORFs is:      20285
Number of ORFs at least 300nt is:       2587
Number of ORFs with specified amino acid distribution is:       2188
Number of ORFs with Kozak sequence is:  103
Number of ORFs with likely protein coding codons is:    8670

*****************************************************************************************
***** Random Sequence With Same Length And GC-Content As Sequence Read-In From File *****
*****************************************************************************************
Number of ORFs is:      19815
Number of ORFs at least 300nt is:       37
Number of ORFs with specified amino acid distribution is:       1566
Number of ORFs with Kozak sequence is:  69
Number of ORFs with likely protein coding codons is:    7034

Task 2: Motif Instances

TATA boxes and Kozak sequences are examples of motifs found in genomics sequences. Instances of these motifs in a genomic sequence, e.g., TATAAA or ACCATGG respectively, can serve as signals to a cell during important biological processes such as transcription and translation.

When investigating a gene in a genome and how the gene is regulated, it may be useful to identify instances of various motifs for the gene. However, identifying instances of motifs in a genomic sequence is non-trivial. For example, the TATA box for most eukaryotic genes is composed of the following six nucleotides: TATAAA. The most commonly occurring instance of a motif is called the consensus sequence for the motif. But some genes have degenerate TATA boxes that differ from the consensus sequence, such as TATATA or CATAAA. If we only searched a genomic sequence for the consensus sequence of a motif, we would miss other (degenerate) instances of the motif.

How then might we search for an instance of a gene's TATA box, if the instance might differ from the consensus sequence? One approach would be to search for sequences of six nucleotides either that match the consensus sequence, TATAAA, or that differ from the consensus sequence only by 1 or 2 nucleotides. This approach, however, has limitations. All TATA box instances have a thymine nucleotide as the 3rd of the six nucleotides. The 5th of the six nucleotides is an adenine about two-thirds of the time and is a thymine about one-third of the time. Ideally, our approach for identifying instances of motifs should take into account the fact that some positions in the motif may contain more variability (e.g., position 5) than other positions (e.g., position 3).

A weight-matrix is a means for describing a motif that takes into account the nucleotide variability in each position of the motif. For instance, a weight-matrix for the TATA box is given below.

		 1	 2	 3	 4	 5	 6
	A	 4%	90%	 0%	95%	66%	97%
	C	10%	 1%	 0%	 0%	 1%	 0%
	G	 3%	 1%	 0%	 0%	 1%	 3%
	T	83%	 8%    100%	 5%	32%	 0%

The weight-matrix above reflects the fact that for a large number of well studied TATA box instances, the first nucleotide is an adenine 4% of the time, a cytosine 10% of the time, a guanine 3% of the time, and a thymine 83% of the time. Looking at the weight-matrix above for the TATA box motif, we would conclude that the sequence TATAAA better fits the TATA box motif than does the sequence GGGGGG.

More formally, given a weight-matrix for a motif, we can calculate the probability that a given sequence (of the same length as the weight-matrix) corresponds to the motif. The probability that a sequence corresponds to a motif is the product of the frequency of each nucleotide in the given sequence as determined by the weight-matrix. For instance, for the TATA box weight-matrix above, the probability that the sequence TATAAA corresponds to the TATA box motif is:
           ProbabilityTATA(TATAAA) = 0.83 * 0.90 * 1.00 * 0.95 * 0.66 * 0.97 ≈ 0.45

The probability that the sequence GGGGGG corresponds to the TATA box motif is:
           ProbabilityTATA(GGGGGG) = 0.03 * 0.01 * 0.00 * 0.00 * 0.01 * 0.03 ≈ 0.0

The probability that the sequence CATTTG corresponds to the TATA box motif is:
           ProbabilityTATA(CATTTG) = 0.10 * 0.90 * 1.00 * 0.05 * 0.32 * 0.03 ≈ 4.3 x 10-5

Note: the SequenceOps class that you developed in Project 1 will be useful in completing this task. Don't be shy about using your previously developed code when programming.

Create a directory named Matrix containing a Java file Matrix.java. Your Matrix class should meet the following specifications.

For example, when executing your program with the file sequence.txt as a command line argument, your program might output the following:

[cs313@CS Matrix] java Matrix sequence.txt


Best hexamer is TATATA with probability 0.22027536

Hexamer GATAAA has log ratio 3.4074597555246435
Hexamer GTTAAA has log ratio 0.9870916268742144
Hexamer GATAAA has log ratio 3.4074597555246435
Hexamer TGTAAA has log ratio 2.2278784043228668
Hexamer TATATA has log ratio 5.514221010107727
Hexamer TTTAGA has log ratio 0.11766520397627707