How to turn in this Project
You are required to turn in both a hardcopy and a softcopy. If working as a team, the team should submit a single hardcopy and a single softcopy (to either team member's account). Please make sure to keep a copy of your work, on your computer, in your private directory (or, to play it safe, both).Hardcopy Submission
Your hardcopy packet should consist of:- The cover page;
- Your
EM_MotifSearch.java
file from Task 1. - Your
Gibbs_MotifSearch.java
file from Task 2.
Softcopy Submission
You should submit your final version of your
MotifFinding
directory to the
drop/project6
directory in your
account on the CS server.
You feel refreshed by your post-graduation meandering. It sure was soothing to be off the grid. I mean, there were times on your travels that you didn't check your messages for hours at a time. But as your phone dings to alert you to a new text message, you are acutely aware that you are back in civilization. The message is from a multinational, pharmaceutical conglomerate. Apparently, they bought out the local biotech company that you worked for last summer. The company wants to hire you to join their computational biology group. Recently, the pharmaceutical behemoth has invested heavily in next-generation DNA sequencing technology and they want you to design a computational infrastructure to support analyses of the large data sets generated from their high-throughput sequencing experiments. You are impressed by the substantial financial resources that the company is devoting to this project and by the company's interest in designing affordable drugs for people in need. Also, it is the only job offer you have. You decide to sign up.
Background
In this project, your goal is to implement two algorithms for characterizing motifs in genomic sequences: Expectation Maximization and Gibbs sampling. We have provided you with one class that you can use:- An instance of the
MotifSearch
class represents a search for a motif common to a set of genomic sequences.
The contract for this class can be found here.
An implementation for this class is stored in the
/home/cs313/download/MotifFinding
subdirectory on
the CS server. To begin, download this directory.
The abovementioned class does not require any modification. Your goal is to
implement two new classes from scratch, EM_MotifSearch
and
Gibbs_MotifSearch
, as described below.
Task 1: Expectation Maximization (EM)
We have provided you with a MotifSearch
application that, when
executed, reads in a set of genomic sequences from a FASTA file.
For example, the provided file
data/modA.txt
contains genomic sequences from 9 organisms.
The 9 genomic sequences were obtained by extracting the upstream
sequences of 9 orthologous genes.
When the MotifSearch
application is invoked as follows
java MotifSearch data/modA.txt 16
then the program will read in from the specified file the 9 genomic sequences.
The second command line argument, 16, in this example indicates the length
of the motif that we are searching for.
The MotifSearch
application
creates an empty motif search, i.e., it does not search for a motif.
You do not need to modify the MotifSearch
class.
In this task, your goal is to create a class, EM_MotifSearch
,
that inherits from the MotifSearch
class and implements
an Expectation Maximization algorithm for identifying motifs.
The contract for the
EM_MotifSearch
class that you are asked to implement
can be found here.
To begin, study the contract of the provided class: MotifSearch
.
This class contains many
methods that will be useful when implementing your EM
algorithm. After studying this class, you should create a new
class, EM_MotifSearch
, that extends the
MotifSearch
class and fulfills the
EM_MotifSearch
contract.
With the code available to download on the CS server, we have provided
you with two files for testing your EM implementation:
modA.txt
and Exercise7_example.txt
.
These two data sets were used and described in
Exercise 7.
We have also
provided you with a sample solution, Test_EM
,
which you can execute and compare to your own solution.
Task 2: Gibbs Sampling
In this task, your goal is to create a class, Gibbs_MotifSearch
,
that inherits from the EM_MotifSearch
class
that you implemented in Task 1. The Gibbs_MotifSearch
class should implement a Gibbs sampling algorithm for identifying
motifs. The contract for the
Gibbs_MotifSearch
class that you are asked to implement
can be found here.
To begin, study the contracts of the MotifSearch
and EM_MotifSearch
classes.
These classes contains many
methods that will be useful when implementing your Gibbs sampling
algorithm. After studying these classes, you should create a new
class, Gibbs_MotifSearch
, that extends the
EM_MotifSearch
class and fulfills the
Gibbs_MotifSearch
contract.
With the code available to download on the CS server, we have provided
you with two files for testing your Gibbs sampling implementation:
modA.txt
and Exercise7_example.txt
.
These two data sets were used and described in
Exercise 7.
We have also
provided you with a sample solution, Test_Gibbs
,
which you can execute and compare to your own solution.