CS313

Project 6

How to turn in this Project

You are required to turn in both a hardcopy and a softcopy. If working as a team, the team should submit a single hardcopy and a single softcopy (to either team member's account). Please make sure to keep a copy of your work, on your computer, in your private directory (or, to play it safe, both).

Hardcopy Submission

Your hardcopy packet should consist of:
  1. The cover page;
  2. Your EM_MotifSearch.java file from Task 1.
  3. Your Gibbs_MotifSearch.java file from Task 2.
Staple these together and submit them on the due date.

Softcopy Submission

You should submit your final version of your MotifFinding directory to the drop/project6 directory in your account on the CS server.


You feel refreshed by your post-graduation meandering. It sure was soothing to be off the grid. I mean, there were times on your travels that you didn't check your messages for hours at a time. But as your phone dings to alert you to a new text message, you are acutely aware that you are back in civilization. The message is from a multinational, pharmaceutical conglomerate. Apparently, they bought out the local biotech company that you worked for last summer. The company wants to hire you to join their computational biology group. Recently, the pharmaceutical behemoth has invested heavily in next-generation DNA sequencing technology and they want you to design a computational infrastructure to support analyses of the large data sets generated from their high-throughput sequencing experiments. You are impressed by the substantial financial resources that the company is devoting to this project and by the company's interest in designing affordable drugs for people in need. Also, it is the only job offer you have. You decide to sign up.


Background

In this project, your goal is to implement two algorithms for characterizing motifs in genomic sequences: Expectation Maximization and Gibbs sampling. We have provided you with one class that you can use:

The contract for this class can be found here.

An implementation for this class is stored in the /home/cs313/download/MotifFinding subdirectory on the CS server. To begin, download this directory. The abovementioned class does not require any modification. Your goal is to implement two new classes from scratch, EM_MotifSearch and Gibbs_MotifSearch, as described below.


Task 1: Expectation Maximization (EM)

We have provided you with a MotifSearch application that, when executed, reads in a set of genomic sequences from a FASTA file. For example, the provided file data/modA.txt contains genomic sequences from 9 organisms. The 9 genomic sequences were obtained by extracting the upstream sequences of 9 orthologous genes. When the MotifSearch application is invoked as follows

java MotifSearch data/modA.txt 16

then the program will read in from the specified file the 9 genomic sequences. The second command line argument, 16, in this example indicates the length of the motif that we are searching for. The MotifSearch application creates an empty motif search, i.e., it does not search for a motif. You do not need to modify the MotifSearch class. In this task, your goal is to create a class, EM_MotifSearch, that inherits from the MotifSearch class and implements an Expectation Maximization algorithm for identifying motifs. The contract for the EM_MotifSearch class that you are asked to implement can be found here.

To begin, study the contract of the provided class: MotifSearch. This class contains many methods that will be useful when implementing your EM algorithm. After studying this class, you should create a new class, EM_MotifSearch, that extends the MotifSearch class and fulfills the EM_MotifSearch contract.

With the code available to download on the CS server, we have provided you with two files for testing your EM implementation: modA.txt and Exercise7_example.txt. These two data sets were used and described in Exercise 7. We have also provided you with a sample solution, Test_EM, which you can execute and compare to your own solution.


Task 2: Gibbs Sampling

In this task, your goal is to create a class, Gibbs_MotifSearch, that inherits from the EM_MotifSearch class that you implemented in Task 1. The Gibbs_MotifSearch class should implement a Gibbs sampling algorithm for identifying motifs. The contract for the Gibbs_MotifSearch class that you are asked to implement can be found here.

To begin, study the contracts of the MotifSearch and EM_MotifSearch classes. These classes contains many methods that will be useful when implementing your Gibbs sampling algorithm. After studying these classes, you should create a new class, Gibbs_MotifSearch, that extends the EM_MotifSearch class and fulfills the Gibbs_MotifSearch contract.

With the code available to download on the CS server, we have provided you with two files for testing your Gibbs sampling implementation: modA.txt and Exercise7_example.txt. These two data sets were used and described in Exercise 7. We have also provided you with a sample solution, Test_Gibbs, which you can execute and compare to your own solution.