How to turn in this Project
You are required to turn in both a hardcopy and a softcopy.
If working as a team, the team should submit a single hardcopy
and a single softcopy (to either team member's account).
Please make sure to keep a copy of your work, on your computer, in
your private directory (or, to play it safe, both).
Hardcopy Submission
Your hardcopy packet should consist of:
- The cover page;
- Your newly developed
ORF_Finder.java
file from Task 1.
- Your newly developed
Matrix.java
file from Task 2.
Staple these together and submit them on the due date.
Softcopy Submission
You should submit your final version of your
ORF_Finder
directory and your
Matrix
directory to the drop/project2
directory in your account on the CS server.
Last week, you helped your friend out by creating a software tool
that facilitated completion of her genomics homework. This week,
it seems your friend's entire biology class has come by looking
for your help to automate some of their data analysis. As part
of a class project, they are analyzing data from a newly sequenced
genome. They are trying to identify novel ORFs and motifs in the
new genomic sequence. Don't they know you have your own work to do.
Ah, the curse of competence! Well, you say to yourself,
the cat's out of the bag. Then you think, did I really just say
that to myself? You make a mental note to cut back on the kitten
videos on YouTube.
Task 1: Finding ORFs
In this task, you will search for potential genes in the DNA from
yeast chromosome #7. This chromosome contains 527 putative protein-coding
genes. Almost all of these genes correspond to an ORF (open reading frame).
An ORF is a genomic sequence beginning with a start codon and
containing only one in-frame stop codon, which occurs at the end of the
sequence. A stop codon is in-frame if it is a multiple of three
nucleotides downstream of the start codon. For example, the sequence
ATGCATCGTAGCTAG
is an ORF because it begins with
ATG
and ends with an in-frame stop codon, TAG
,
with no other in-frame stop codons appearing in the sequence. In
contrast, the sequence ATGGATCTAG
is not an ORF because
the stop codon TAG
is not in-frame with the start codon.
Similarly, the sequence ATGACCTAGGGTTAG
is not an ORF
(though it contains an ORF within the sequence) because there is an
in-frame stop codon TAG
occurring before the end of the
sequence. It is also important to note that ORFs can occur within other ORFs.
For example, the sequence ATGCATGCTAGCATGCTAGCATGCTAG
contains three ORFs: ATGCATGCTAGCATGCTAGCATGCTAG
,
ATGCTAGCATGCTAG
, and ATGCTAGCATGCTAG
.
When searching for genes in a genomic sequence, it is useful
to identify ORFs because, although there are many more ORFs than genes
in genomes, most genes correspond to an ORF.
In this task, you will identify ORFs that evince properties common to
known yeast genes. To begin, download the
genomic sequence for yeast
chromosome #7 in Fasta format.
Note: the SequenceOps
class that you developed in
Project 1 will be useful in completing this task. Don't be
shy about using your previously developed code when
programming.
Create a directory named ORF_Finder
containing
a Java file ORF_Finder.java
. Your ORF_Finder
class should meet the following specifications.
- A genomic sequence should be read in from a Fasta formatted file
(e.g.,
yeast7.txt
) specified by a command line
argument.
- The
main
method should print out the number of ORFs in the
read-in genomic sequence. (Note: it is ok to compute the number of ORFs
using a quadratic time algorithm, O(n^2), but if you really want to impress your
professor, see if you can come up with a linear time algorithm, O(n).)
- Most yeast proteins are at least 100 amino acids in length and, thus,
most yeast genes are at least 300 nucleotides in length.
The
main
method should print out the number of ORFs in the
read-in genomic sequence that are at least 300 nucleotides in length.
- Cysteine and tryptophan are amino acids that occur infrequently in most
yeast proteins whereas alanine and leucine are amino acids that occur
more frequently in most yeast proteins.
The
main
method should print out the number of ORFs in the
read-in genomic sequence that would code for proteins containing less
than 4% cysteines and less than 4% tryptophans and more than 6%
alanines and more than 6% leucines.
- A Kozak sequence is a sequence of nucleotides surrounding a start codon
that facilitates ribosome binding. A Kozak sequence can be expressed as
RCCATGG
, where the character R
represents
any purine, either adanine or guanine. The Kozak sequence encompasses
the first codon of a gene sequence, i.e., the ATG
in the
Kozak sequence should correspond to a start codon.
The main
method should print out the number of ORFs in the
read-in genomic sequence for which there is a corresponding
Kozak sequence.
- Some codons are used more than others when coding for the same amino
acid. For example, both
AAA
and AAG
code for
the amino acid lysine, but in yeast genes, lysine is coded by AAA
approximately 58% of the time and by AAG
approximately 42% of
the time. In contrast, in intergenic regions of the yeast genome,
AAA
accounts for approximately 70% and AAG
accounts
for approximately 30% of occurrences of either AAA
or
AAG
.
The file codonUsage.txt indicates how often
various codons are used to code for their corresponding amino acid in yeast
protein coding sequences and in yeast non-coding sequences.
In other words, nucleotide triplets occur at different rates
in protein-coding sequences than in non-coding sequences. The abundance of
different tRNA molecules is one factor that can contribute to an organism's
bias for or against particular codons in protein-coding sequences. These biases
can be used to help identify genes, e.g., we may hypothesize that an ORF
with a ratio of AAA
to AAG
of 58:42 is more likely
to correspond to a protein-coding gene than an ORF with a ratio of 70:30.
The probabilities
in codonUsage.txt can be used to help
distinguish coding ORFs from spurious ORFs by comparing the product of
coding probabilities for each codon in the ORF to the product of
noncoding probabilities for each codon in the ORF. For example, the probability
that the ORF ATGCCAATTTAA
is coding vs. noncoding based on codon usage
can be expressed as:
ProbabilityCoding(ATGCCAATTTAA) = ProbabilityCoding(ATG) * ProbabilityCoding(CCA) *
ProbabilityCoding(ATT) * ProbabilityCoding(TAA)
≈ 1.0 * 0.416 * 0.464 * 0.424
≈ 0.082
and
ProbabilityNoncoding(ATGCCAATTTAA) = ProbabilityNoncoding(ATG) * ProbabilityNoncoding(CCA) *
ProbabilityNoncoding(ATT) * ProbabilityNoncoding(TAA)
≈ 1.0 * 0.316 * 0.396 * 0.455
≈ 0.057
In this case, ProbabilityCoding(ATGCCAATTTAA)
is larger than
ProbabilityNoncoding(ATGCCAATTTAA)
so we hypothesize that the ORF
ATGCCAATTTAA
is more likely to correspond to a coding sequence than
a noncoding sequence. Based on the codon usage probabilities found in
codonUsage.txt,
the main
method should print out the number of ORFs in the
read-in genomic sequence for which the probability of the ORF corresponding to
a coding sequence is greater than the probability of the ORF corresponding to a
noncoding sequence.
- For comparison, your ORF_Finder program should generate a random
DNA nucleotide sequence with the same length and same expected GC
content as the read-in genomic sequence.
The
main
method should print out for the random sequence
the number of ORFs, the number of ORFs at least 300 nucleotides in length,
the number of ORFs that would code for proteins containing less than
4% cysteines and less than 4% tryptophans and more than 6% alanines and
more than 6% leucines, the number of ORFs for which there is a
corresponding Kozak sequence, and the number of ORFs that are more
likely to correspond to a coding sequence than a noncoding sequences based
on codon usage statistics.
- Finally, ensure your program is able to execute on the genomic sequence
from yeast chromosome #7 (as the professor
will be testing your program on this file).
As examples, when executing your program with the file
shortGenomicSequence.txt
as a command line argument, your program might output the following:
[cs313@CS ORF_Finder] java ORF_Finder shortGenomicSequence.txt
**************************************************************
***** Genomic Sequence Read-In From Fasta Formatted File *****
**************************************************************
Number of ORFs is: 3
Number of ORFs at least 300nt is: 0
Number of ORFs with specified amino acid distribution is: 0
Number of ORFs with Kozak sequence is: 1
Number of ORFs with likely protein coding codons is: 1
*****************************************************************************************
***** Random Sequence With Same Length And GC-Content As Sequence Read-In From File *****
*****************************************************************************************
Number of ORFs is: 0
Number of ORFs at least 300nt is: 0
Number of ORFs with specified amino acid distribution is: 0
Number of ORFs with Kozak sequence is: 0
Number of ORFs with likely protein coding codons is: 0
|
When executing your program with the file
yeast7.txt
as a command line argument, your program might output the following:
[cs313@CS ORF_Finder] java ORF_Finder yeast7.txt
**************************************************************
***** Genomic Sequence Read-In From Fasta Formatted File *****
**************************************************************
Number of ORFs is: 20285
Number of ORFs at least 300nt is: 2587
Number of ORFs with specified amino acid distribution is: 2188
Number of ORFs with Kozak sequence is: 103
Number of ORFs with likely protein coding codons is: 8670
*****************************************************************************************
***** Random Sequence With Same Length And GC-Content As Sequence Read-In From File *****
*****************************************************************************************
Number of ORFs is: 19815
Number of ORFs at least 300nt is: 37
Number of ORFs with specified amino acid distribution is: 1566
Number of ORFs with Kozak sequence is: 69
Number of ORFs with likely protein coding codons is: 7034
|
Task 2: Motif Instances
TATA boxes and Kozak sequences are examples of motifs found in
genomics sequences. Instances of these motifs in a genomic sequence,
e.g., TATAAA
or ACCATGG
respectively, can
serve as signals to a cell during important biological processes
such as transcription and translation.
When investigating a gene in a genome and how the gene is regulated,
it may be useful to identify instances of various motifs for the
gene. However, identifying instances of motifs in a genomic sequence
is non-trivial. For example, the TATA box for most eukaryotic genes
is composed of the following six nucleotides: TATAAA
.
The most commonly occurring instance of a motif is called the consensus
sequence for the motif. But some genes have degenerate TATA boxes
that differ from the consensus sequence, such as TATATA
or CATAAA
. If we only searched a genomic sequence for the
consensus sequence of a motif, we would miss other (degenerate)
instances of the motif.
How then might we search for an instance of a gene's TATA box,
if the instance might differ from the consensus sequence? One approach
would be to search for sequences of six nucleotides either that match
the consensus sequence, TATAAA
, or that differ from the consensus
sequence only by 1 or 2 nucleotides. This approach, however, has
limitations. All TATA box instances have a thymine nucleotide as the
3rd of the six nucleotides. The 5th of the six nucleotides is an
adenine about two-thirds of the time and is a thymine about one-third
of the time. Ideally, our approach for identifying instances of motifs
should take into account the fact that some positions in the motif may
contain more variability (e.g., position 5) than other positions
(e.g., position 3).
A weight-matrix is a means for describing a motif that takes into
account the nucleotide variability in each position of the motif.
For instance, a weight-matrix for the TATA box is given below.
1 2 3 4 5 6
A 4% 90% 0% 95% 66% 97%
C 10% 1% 0% 0% 1% 0%
G 3% 1% 0% 0% 1% 3%
T 83% 8% 100% 5% 32% 0%
The weight-matrix above reflects the fact that for a large number
of well studied TATA box instances, the first nucleotide is an
adenine 4% of the time, a cytosine 10% of the time, a guanine 3% of
the time, and a thymine 83% of the time. Looking at the weight-matrix
above for the TATA box motif, we would conclude that the sequence
TATAAA
better fits the TATA box motif than does
the sequence GGGGGG
.
More formally, given a weight-matrix for a motif, we can calculate
the probability that a given sequence (of the same length as the
weight-matrix) corresponds to the motif. The probability that a
sequence corresponds to a motif is the product of the frequency of
each nucleotide in the given sequence as determined by the weight-matrix.
For instance, for the TATA box weight-matrix above, the probability that
the sequence TATAAA
corresponds to the TATA box motif is:
ProbabilityTATA(TATAAA
) = 0.83 * 0.90 * 1.00 * 0.95 * 0.66 * 0.97 ≈ 0.45
The probability that the sequence GGGGGG
corresponds to the TATA box motif is:
ProbabilityTATA(GGGGGG
) = 0.03 * 0.01 * 0.00 * 0.00 * 0.01 * 0.03 ≈ 0.0
The probability that the sequence CATTTG
corresponds to the TATA box motif is:
ProbabilityTATA(CATTTG
) = 0.10 * 0.90 * 1.00 * 0.05 * 0.32 * 0.03 ≈ 4.3 x 10-5
Note: the SequenceOps
class that you developed in
Project 1 will be useful in completing this task. Don't be
shy about using your previously developed code when
programming.
Create a directory named Matrix
containing
a Java file Matrix.java
. Your Matrix
class should meet the following specifications.
- A genomic sequence should be read in from a Fasta formatted file
specified by a single command line argument.
- The program should search the read-in genomic sequence and identify
the hexamer within the sequence that has
the highest probability of being an instance of the
TATA box motif (based on the weight-matrix given above). The
main
method should print out the highest scoring
hexamer along with its probability.
- When assessing how likely it is that a hexamer corresponds to
the TATA box motif, it would be useful to know how likely it is that
the hexamer occurs by random chance in the genomic sequence. Consider
that, in addition to all of the genuine TATA boxes in a genome, there
will be many spurious sequences similar to TATA boxes that occur in a
genome by random chance. In particular, in a large genomic sequence
with a high concentration of adenine and thymine nucleotides, the
hexamer
TATAAA
may occur by random chance. It stands to
reason that, in genomes with low GC content, there will be more
spurious sequences that are similar to TATA boxes and, in genomes
with high GC content, there will be fewer spurious sequences that
are similar to TATA boxes. Furthermore, if we observe a sequence such
as TATAAA
in a genome with high GC content, we may
have more confidence that it corresponds to a TATA box than if we
observe the same sequence in a genome with low GC content since
the sequence is more likely to occur by chance in the low GC
content genome.
Here, you must assess how likely it is that a hexamer corresponds
to a TATA box by considering how likely it is that the hexamer
occurs by chance. In a genome such as that of yeast, with a GC
content of 38%, we can use the following background weight-matrix
to determine the probability that a hexamer occurs by chance.
1 2 3 4 5 6
A 31% 31% 31% 31% 31% 31%
C 19% 19% 19% 19% 19% 19%
G 19% 19% 19% 19% 19% 19%
T 31% 31% 31% 31% 31% 31%
As examples, the probability that the sequence TATAAA
occurs by chance is:
ProbabilityChance(TATAAA
) = 0.31 * 0.31 * 0.31 * 0.31 * 0.31 * 0.31 ≈ 8.9 x 10-4
The probability that the sequence GGGGGG
occurs by chance is:
ProbabilityChance(GGGGGG
) = 0.19 * 0.19 * 0.19 * 0.19 * 0.19 * 0.19 ≈ 4.7 x 10-5
The probability that the sequence CATTTG
occurs by chance is:
ProbabilityChance(CATTTG
) = 0.19 * 0.31 * 0.31 * 0.31 * 0.31 * 0.19 ≈ 3.3 x 10-4
The likelihood of a hexamer, X, corresponding to a TATA
box, then, can be described as the probability of the hexamer
corresponding to the TATA box weight-matrix divided by the probability
of the hexamer corresponding to the background weight-matrix.
Likelihood = ProbabilityTATA(X) / ProbabilityChance(X)
If the hexamer better fits the TATA box weight-matrix than the
background weight-matrix, then the numerator will be larger than the
denominator and the likelihood will be greater than 1.0. If the hexamer
is more likely to occur by chance than to correspond to a TATA box,
then denominator will be larger than the numerator and the likelihood
will be less than 1.0.
Finally, the log likelihood of a hexamer, X,
corresponding to a TATA box is the natural logarithm of the
likelihood of X corresponding to a TATA box.
LogLikelihood = ln(ProbabilityTATA(X) / ProbabilityChance(X))
If the hexamer better fits the TATA box weight-matrix than the
background weight-matrix, then the log likelihood will be larger than 0.0.
If the hexamer is more likely to occur by chance than to correspond to a TATA box,
then the log likelihood will be less than 0.0.
The program should search the read-in genomic sequence and print
out every hexamer in the sequence that is more likely to correspond
to a TATA box instance than to occur by chance (based on the two
weight-matrices above). In addition to printing out each such hexamer,
the log likelihood of each such hexamer should be printed out.
For example, when executing your program with the file
sequence.txt
as a command line argument, your program might output the following:
[cs313@CS Matrix] java Matrix sequence.txt
Best hexamer is TATATA with probability 0.22027536
Hexamer GATAAA has log ratio 3.4074597555246435
Hexamer GTTAAA has log ratio 0.9870916268742144
Hexamer GATAAA has log ratio 3.4074597555246435
Hexamer TGTAAA has log ratio 2.2278784043228668
Hexamer TATATA has log ratio 5.514221010107727
Hexamer TTTAGA has log ratio 0.11766520397627707
|