Reading
How to turn in this Project
You are required to turn in both a hardcopy and a softcopy. If working as a team, the team should submit a single hardcopy and a single softcopy (to either team member's account). Please make sure to keep a copy of your work, on your computer, in your private directory (or, to play it safe, both).Hardcopy Submission
Your hardcopy packet should consist of:- The cover page;
- Your newly developed
SequenceOps.java
file from Task 1.
Softcopy Submission
You should submit your final version of your
SequenceOps
directory to the
drop/project1
directory in your account
on the CS
server.
Help! One of your friends is in despair. She is overwhelmed with course work this week. She jokes, "if only you could write a computer program for me that critically analyzes Shakespeare plays and generates an 8-page paper filled with penetrating insights based on the results!" You sympathize with her situation. You tell her that you won't be learning how to write such a program until your Computer Science course next semester. However, you suggest that maybe you can help her in another way. You tell her that you can create a software tool that will enable her to complete her genomics homework more efficiently...
Task 1: Genomic Sequence Operations
In this problem, you will implement a number of operations on
genomic sequences. To begin, create a new Java class called
SequenceOps
. Your SequenceOps
class must
implement the 11 class methods specified below. You are free
to include any additional helper methods or variables that you deem
appropriate. You can experiment with a sample solution to each
of the SequenceOps
methods by downloading the
SequenceOps_Sols.class
file from the download
directory on the CS
server. This is a compiled class
file rather than a java
file, so you won't be able to open
the file and see the solution code. You should include this
SequenceOps_Sols.class
file in the same folder as your
java
files (or the same folder as your class
files depending on which environment you use to develop your code, and you
may need to add the file to your project
, again depending
on which environment you use to develop your code). You can then invoke the
sample solution methods, e.g., SequenceOps_Sols.reverse("CGTCAGGCA")
or
SequenceOps_Sols.GC_content("ACGGACTGC")
, from within your
own SequenceOps
Java file.
Note: the efficiency of your implementations matters! While we
are not concerned about constant time differences in runtimes
(if one implementation is twice as fast as another), we are
concerned about asymptotic efficiency as inputs become large.
For instance, you should not
implement a method whose running time is quadratic in the size
of the input if there is another way to implement the method
whose running time is linear in the size of the input.
Inefficient implementations are most likely to occur when
Strings
are used, as opposed to
StringBuilders
! Since Strings
are immutable, each time you add to a String
,
the String
must be copied over. In contrast,
you can add to a StringBuilder
without
copying it over. As a general guide, whenever possible use
StringBuilders
for your sequence operations
rather than Strings
. If a method specifies that
you return a String
, you can easily convert
between StringBuilders
and Strings
.
public static String sequenceFromFastaFile(String fileName)
Returns a String corresponding to the genomic sequence found in the specified FASTA file.public static String sequenceToFastaFormat(String s)
Takes the specified genomic sequences
and returns a String representing the sequence in FASTA format. Each line in the returned String should be 60 characters in length, except possibly the last line. The returned String need not include the FASTA header line beginning with the character '>'.public static String reverse(String s)
Returns a String corresponding to the reversed version of the specified sequences
. For example, ifs
represents the sequenceACGGACTGC
then the method should return the stringCGTCAGGCA
.public static String complement(String s)
Returns a String corresponding to the complement of the specified nucleotide sequences
. For example, ifs
represents the sequenceACGGACTGC
then the method should return the stringTGCCTGACG
.public static String reverseComplement(String s)
Returns a String corresponding to the reverse complement of the specified nucleotide sequences
. For example, ifs
represents the sequenceACGGACTGC
then the method should return the stringGCAGTCCGT
.public static double GC_content(String s)
Returns the GC content of the specified nucleotide sequences
. For example, ifs
represents the sequenceACGGACTGC
then the method should return0.6666666666666666
.public static String randomPermutation(String s)
Returns a random permutation of the specified sequences
. The returned String should have all the same characters as the input Strings
, only the order of the characters should be determined randomly. Each invocation of the method on a particular Strings
should result in a (almost certainly) different permutation ofs
.public static String randomSequence(int length, double GC_content)
Returns a random genomic sequence of the specified length with the expected specified GC content. Each nucleotide in the random sequence can be generated independently. The probability that each nucleotide is aG
orC
(as opposed toA
orT
) is specified by theGC_content
parameter. Each nucleotide has an equal probability of beingA
orT
. Each nucleotide has an equal probability of beingG
orC
. Note: each invocation of therandomSequence
method may not generate a random sequence with a GC content equal to that specified by theGC_content
parameter. However, the expected GC content will be equal to theGC_content
parameter, i.e., if you invoked the method infinitely many times then the GC content of all randomly generated sequences should equal theGC_content
parameter.public static String randomSampling(String s)
Returns a random genomic sequence with the same length ass
and the same expected GC content ass
. Note: each invocation of therandomSampling
method may not generate a random sequence with a GC content equal to that ofs
. However, the expected GC content will be equal to that ofs
, i.e., if you invoked the method infinitely many times then the GC content of all randomly generated sequences should equal that ofs
.public static char translateCodon(String s)
Assuming the input sequences
is a genomic sequence of exactly 3 nucleotides, the method translates the codon, i.e., it returns a character representing the amino acid corresponding to the codon. For example, if the input sequences
isCUA
then the method should return the characterL
corresponding to the amino acid Leucine. A table of codon translations is available in HTML format and in tab-delimited text format. Note: the expected running time of this method should not be linear in the size of the table, i.e., the method should not compares
to each of the 64 codons. Instead, the expected running time of the method should be constant. Consider what data structures would be appropriate to store the table in order to support a constant expected running time of this method.public static String translateORF(String s)
Assuming the length of the input sequences
is divisible by 3, the method returns a translated version of the sequence, i.e., every 3 nucleotides ins
are translated into an amino acid, and the sequence of amino acids is returned. For example, ifs
represents the sequenceAUGGCCUUUCGAUAG
then the method should return the stringMAFR*
.
Download the file yeastGenome.txt. This file contains the entire yeast genome in FASTA format. Execute your
GC_content
method on the yeast genome and verify that the
GC content of the yeast genome is 38%. In a comment above your
GC_content
method, indicate the GC content of the
Escherichia coli genome and the GC content of
human chromosome 22.