Class EM_MotifSearch

java.lang.Object
  extended by MotifSearch
      extended by EM_MotifSearch
Direct Known Subclasses:
Gibbs_MotifSearch

public class EM_MotifSearch
extends MotifSearch

An instance of the EM_MotifSearch class represents a search using the EM algorithm for a motif common to a set of genomic sequences.


Field Summary
 
Fields inherited from class MotifSearch
instanceLocations, matrix, motifLength, numSequences, sequences
 
Constructor Summary
EM_MotifSearch(java.lang.String fileName, int motifLength)
          Creates an initially empty EM_MotifSearch.
 
Method Summary
 void determineMatrixModel()
          The Maximization step in the EM algorithm.
 void determineMotifInstances()
          The Expectation step in the EM algorithm.
 void EM()
          Executes the EM (Expectation Maximization) algorithm.
 double getInformationContentOfMatrix()
          Returns the information content associated with the matrix model.
 double getScoreForMotifInstance(java.lang.String s)
          Given a candidate motif instance, returns the score (probability) of that instance based on the matrix model.
static void main(java.lang.String[] args)
          The main method generates an EM_MotifSearch for a motif of the specified length in the genomic sequences found in the specified FASTA file.
 void run_EM_multiple_times(int iterations)
          Executes the EM (Expectation Maximization) algorithm multiple times.
 void setRandomLocationsForMotifInstances()
          The initial random seed step in the EM algorithm.
 
Methods inherited from class MotifSearch
addPseudocountsToMatrix, getConsensusSequence, getInstanceLocations, getMatrix, getNucleotideContent, matrixToString, motifInstancesToString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

EM_MotifSearch

public EM_MotifSearch(java.lang.String fileName,
                      int motifLength)
Creates an initially empty EM_MotifSearch.

A set of genomic sequences is determined from the specified String representing the name of a file. Genomic sequences are read-in from the FASTA file. The integer parameter represents the desired length of the motif being searched for. Initially, the constructed EM_MotifSearch is empty. This constructor need only invoke the constructor of the super class.

Parameters:
fileName - the name of a FASTA file containing one or more genomic sequences
motifLength - the length of the desired motif
Method Detail

setRandomLocationsForMotifInstances

public void setRandomLocationsForMotifInstances()
The initial random seed step in the EM algorithm.

For each sequence, randomly determine the start index of a motif instance in the sequence.


determineMatrixModel

public void determineMatrixModel()
The Maximization step in the EM algorithm.

Based on the motif instances in the sequences, creates a matrix motif model. The resulting matrix model should be updated with pseudocounts so that no entries in the matrix correspond to 0.0.


determineMotifInstances

public void determineMotifInstances()
The Expectation step in the EM algorithm.

Based on the matrix model, identifies motif instances in the sequences. One motif instance is identified in each sequence. For each sequence, the motif instance that best matches the model is chosen.


getScoreForMotifInstance

public double getScoreForMotifInstance(java.lang.String s)
Given a candidate motif instance, returns the score (probability) of that instance based on the matrix model.

The length of the motif instance specified by String s must be the same as the number of columns in the matrix model.

Parameters:
s - a String corresponding to a motif instance
Returns:
the score (probability) of the motif instance matching the matrix model

getInformationContentOfMatrix

public double getInformationContentOfMatrix()
Returns the information content associated with the matrix model.

The information content of a matrix model is described in Task 2 of Exercise 7. The method getNucleotideContent may be useful in determining the background frequency of different nucleotides.

Returns:
the information content of the matrix model

EM

public void EM()
Executes the EM (Expectation Maximization) algorithm.

Initially, the EM algorithm is randomly seeded, i.e., one motif instance is randomly chosen in each sequence. Then, the Maximization and Expectation steps are alternately repeated until convergence. The EM algorithm converges when the information content of the matrix model no longer improves.


run_EM_multiple_times

public void run_EM_multiple_times(int iterations)
Executes the EM (Expectation Maximization) algorithm multiple times.

The number of times that the algorithm is executed is specified by the integer parameter. The best motif, as determined by information content, is identified over all executions of the algorithm. Upon completion of this method, this EM_MotifSearch should correspond to the best motif (including matrix model and motif instances) identified over all executions of the algorithm.


main

public static void main(java.lang.String[] args)
The main method generates an EM_MotifSearch for a motif of the specified length in the genomic sequences found in the specified FASTA file. The EM algorithm is executed the specified number of iterations. The set of motif instances, matrix, consensus sequence, and information content corresponding to the maximum over all iterations are output.

Parameters:
args - an array of Strings representing any command line arguments