Class Gibbs_MotifSearch

java.lang.Object
  extended by MotifSearch
      extended by EM_MotifSearch
          extended by Gibbs_MotifSearch

public class Gibbs_MotifSearch
extends EM_MotifSearch

An instance of the Gibbs_MotifSearch class represents a search using the Gibbs sampling algorithm for a motif common to a set of genomic sequences.


Field Summary
 
Fields inherited from class MotifSearch
instanceLocations, matrix, motifLength, numSequences, sequences
 
Constructor Summary
Gibbs_MotifSearch(java.lang.String fileName, int motifLength)
          Creates an initially empty Gibbs_MotifSearch.
 
Method Summary
 void determineMotifInstances()
          The Expectation step in the EM algorithm.
 int getIndexViaSampling(java.util.Vector<java.lang.Double> values)
          Returns the index of a randomly sampled value in a Vector.
static void main(java.lang.String[] args)
          The main method generates a Gibbs_MotifSearch for a motif of the specified length in the genomic sequences found in the specified FASTA file.
 
Methods inherited from class EM_MotifSearch
determineMatrixModel, EM, getInformationContentOfMatrix, getScoreForMotifInstance, run_EM_multiple_times, setRandomLocationsForMotifInstances
 
Methods inherited from class MotifSearch
addPseudocountsToMatrix, getConsensusSequence, getInstanceLocations, getMatrix, getNucleotideContent, matrixToString, motifInstancesToString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Gibbs_MotifSearch

public Gibbs_MotifSearch(java.lang.String fileName,
                         int motifLength)
Creates an initially empty Gibbs_MotifSearch.

A set of genomic sequences is determined from the specified String representing the name of a file. Genomic sequences are read-in from the FASTA file. The integer parameter represents the desired length of the motif being searched for. Initially, the constructed Gibbs_MotifSearch is empty. This constructor need only invoke the constructor of the super class.

Parameters:
fileName - the name of a FASTA file containing one or more genomic sequences
motifLength - the length of the desired motif
Method Detail

determineMotifInstances

public void determineMotifInstances()
The Expectation step in the EM algorithm.

Based on the matrix model, identifies motif instances in the sequences. One motif instance is identified in each sequence. For each sequence, a motif instance is chosen by sampling the scores of each possible motif instance in that sequence. The score of each possible motif instance is based on the matrix model.

Overrides:
determineMotifInstances in class EM_MotifSearch

getIndexViaSampling

public int getIndexViaSampling(java.util.Vector<java.lang.Double> values)
Returns the index of a randomly sampled value in a Vector.

One value from the Vector is chosen at random and the value's index (not the value itself) is returned. The value is not chosen uniformly at random, but rather via sampling, i.e., higher values are more likely to be chosen and lower values are less likely to be chosen.

One approach for randomly sampling a collection of values proceeds as follows:

  1. Normalize the values in the collection so that they sum to 1.0. The values now represent a probability distribution.
  2. Convert the values from a probability distribution to a cumulative distribution. In a cumulative distribution, the value at index i represents the sum of all values at indices less than or equal to i in the probability distribution. The final value in a cumulative distribution should be 1.0 since the sum of all values in a probability distribution is 1.0.
  3. Generate a number uniformly at random between 0.0 and 1.0. Return the index of the smallest value in the cumulative distribution that is at least as big as the random number.

Parameters:
values - a Vector of decimal numbers to be sampled
Returns:
the index of a randomly sampled value from the Vector

main

public static void main(java.lang.String[] args)
The main method generates a Gibbs_MotifSearch for a motif of the specified length in the genomic sequences found in the specified FASTA file. The Gibbs sampling algorithm is executed the specified number of iterations. The set of motif instances, matrix, consensus sequence, and information content corresponding to the maximum over all iterations are output.

Parameters:
args - an array of Strings representing any command line arguments