CS313

Project 1

Reading

How to turn in this Project

You are required to turn in both a hardcopy and a softcopy. If working as a team, the team should submit a single hardcopy and a single softcopy (to either team member's account). Please make sure to keep a copy of your work, on your computer, in your private directory (or, to play it safe, both).

Hardcopy Submission

Your hardcopy packet should consist of:
  1. The cover page;
  2. Your newly developed SequenceOps.java file from Task 1.
Staple these together and submit them on the due date.

Softcopy Submission

You should submit your final version of your SequenceOps directory to the drop/project1 directory in your account on the CS server.


Help! One of your friends is in despair. She is overwhelmed with course work this week. She jokes, "if only you could write a computer program for me that critically analyzes Shakespeare plays and generates an 8-page paper filled with penetrating insights based on the results!" You sympathize with her situation. You tell her that you won't be learning how to write such a program until your Computer Science course next semester. However, you suggest that maybe you can help her in another way. You tell her that you can create a software tool that will enable her to complete her genomics homework more efficiently...


Task 1: Genomic Sequence Operations

In this problem, you will implement a number of operations on genomic sequences. To begin, create a new Java class called SequenceOps. Your SequenceOps class must implement the 11 class methods specified below. You are free to include any additional helper methods or variables that you deem appropriate. You can experiment with a sample solution to each of the SequenceOps methods by downloading the SequenceOps_Sols.class file from the download directory on the CS server. This is a compiled class file rather than a java file, so you won't be able to open the file and see the solution code. You should include this SequenceOps_Sols.class file in the same folder as your java files (or the same folder as your class files depending on which environment you use to develop your code, and you may need to add the file to your project, again depending on which environment you use to develop your code). You can then invoke the sample solution methods, e.g., SequenceOps_Sols.reverse("CGTCAGGCA") or SequenceOps_Sols.GC_content("ACGGACTGC"), from within your own SequenceOps Java file.

Note: the efficiency of your implementations matters! While we are not concerned about constant time differences in runtimes (if one implementation is twice as fast as another), we are concerned about asymptotic efficiency as inputs become large. For instance, you should not implement a method whose running time is quadratic in the size of the input if there is another way to implement the method whose running time is linear in the size of the input. Inefficient implementations are most likely to occur when Strings are used, as opposed to StringBuilders! Since Strings are immutable, each time you add to a String, the String must be copied over. In contrast, you can add to a StringBuilder without copying it over. As a general guide, whenever possible use StringBuilders for your sequence operations rather than Strings. If a method specifies that you return a String, you can easily convert between StringBuilders and Strings.

  1. public static String sequenceFromFastaFile(String fileName)
    Returns a String corresponding to the genomic sequence found in the specified FASTA file.

  2. public static String sequenceToFastaFormat(String s)
    Takes the specified genomic sequence s and returns a String representing the sequence in FASTA format. Each line in the returned String should be 60 characters in length, except possibly the last line. The returned String need not include the FASTA header line beginning with the character '>'.

  3. public static String reverse(String s)
    Returns a String corresponding to the reversed version of the specified sequence s. For example, if s represents the sequence ACGGACTGC then the method should return the string CGTCAGGCA.

  4. public static String complement(String s)
    Returns a String corresponding to the complement of the specified nucleotide sequence s. For example, if s represents the sequence ACGGACTGC then the method should return the string TGCCTGACG.

  5. public static String reverseComplement(String s)
    Returns a String corresponding to the reverse complement of the specified nucleotide sequence s. For example, if s represents the sequence ACGGACTGC then the method should return the string GCAGTCCGT.

  6. public static double GC_content(String s)
    Returns the GC content of the specified nucleotide sequence s. For example, if s represents the sequence ACGGACTGC then the method should return 0.6666666666666666.

  7. Download the file yeastGenome.txt. This file contains the entire yeast genome in FASTA format. Execute your GC_content method on the yeast genome and verify that the GC content of the yeast genome is 38%. In a comment above your GC_content method, indicate the GC content of the Escherichia coli genome and the GC content of human chromosome 22.

  8. public static String randomPermutation(String s)
    Returns a random permutation of the specified sequence s. The returned String should have all the same characters as the input String s, only the order of the characters should be determined randomly. Each invocation of the method on a particular String s should result in a (almost certainly) different permutation of s.

  9. public static String randomSequence(int length, double GC_content)
    Returns a random genomic sequence of the specified length with the expected specified GC content. Each nucleotide in the random sequence can be generated independently. The probability that each nucleotide is a G or C (as opposed to A or T) is specified by the GC_content parameter. Each nucleotide has an equal probability of being A or T. Each nucleotide has an equal probability of being G or C. Note: each invocation of the randomSequence method may not generate a random sequence with a GC content equal to that specified by the GC_content parameter. However, the expected GC content will be equal to the GC_content parameter, i.e., if you invoked the method infinitely many times then the GC content of all randomly generated sequences should equal the GC_content parameter.

  10. public static String randomSampling(String s)
    Returns a random genomic sequence with the same length as s and the same expected GC content as s. Note: each invocation of the randomSampling method may not generate a random sequence with a GC content equal to that of s. However, the expected GC content will be equal to that of s, i.e., if you invoked the method infinitely many times then the GC content of all randomly generated sequences should equal that of s.

  11. public static char translateCodon(String s)
    Assuming the input sequence s is a genomic sequence of exactly 3 nucleotides, the method translates the codon, i.e., it returns a character representing the amino acid corresponding to the codon. For example, if the input sequence s is CUA then the method should return the character L corresponding to the amino acid Leucine. A table of codon translations is available in HTML format and in tab-delimited text format. Note: the expected running time of this method should not be linear in the size of the table, i.e., the method should not compare s to each of the 64 codons. Instead, the expected running time of the method should be constant. Consider what data structures would be appropriate to store the table in order to support a constant expected running time of this method.

  12. public static String translateORF(String s)
    Assuming the length of the input sequence s is divisible by 3, the method returns a translated version of the sequence, i.e., every 3 nucleotides in s are translated into an amino acid, and the sequence of amino acids is returned. For example, if s represents the sequence AUGGCCUUUCGAUAG then the method should return the string MAFR*.