public class Alignment
extends java.lang.Object
Constructor and Description |
---|
Alignment(java.io.File file1,
java.io.File file2)
Creates an Alignment from the genomic sequences found
in the two specified FASTA files.
|
Alignment(java.lang.String sequence1,
java.lang.String sequence2)
Creates an Alignment from the two specified genomic sequences.
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
alignmentTableToString()
Returns a
String representation of the alignment table generated during
alignment computation. |
java.lang.String |
backtrackTableToString()
Returns a
String representation of the backtrack table generated during
alignment computation. |
void |
calculatePValue()
Estimates the p-value of this
Alignment . |
void |
computeAlignment()
Computes the optimal pairwise alignment of two genomic sequences.
|
java.lang.String |
getAlignment()
Returns the optimal pairwise alignment.
|
int |
getAlignmentScore()
Returns the optimal pairwise alignment score for this
Alignment . |
double |
getPValue()
Returns the p-value of this
Alignment . |
static void |
main(java.lang.String[] args)
The
main method creates an optimal pairwise alignment for two genomic sequences. |
static void |
outputHistogramOfRandomAlignmentScores(java.lang.String fileName,
java.util.Vector<java.lang.Integer> v)
Outputs a histogram of optimal pairwise alignment scores to a file.
|
java.lang.String |
sequence1()
Returns the first of two genomic sequences in this
Alignment . |
java.lang.String |
sequence2()
Returns the second of two genomic sequences in this
Alignment . |
void |
setAffineGaps(int alphaGapScore,
int betaGapScore)
Score gaps in this
Alignment using an affine model. |
void |
setFastAlignment(int numGaps)
Indicate that a fast linear-time pairwise alignment should be performed.
|
void |
setFixedScoring(int matchScore,
int mismatchScore)
When aligning two characters, one from each genomic sequence, if the two characters
are identical then the alignment score of the two characters should be the
match score. |
void |
setGlobalAlignment()
Indicate that a global pairwise alignment should be performed.
|
void |
setLinearGaps(int linearGapScore)
Score gaps in this
Alignment using a linear model. |
void |
setLocalAlignment()
Indicate that a local pairwise alignment should be performed.
|
void |
setMatrixScoring(java.lang.String fileName)
When aligning two characters, one from each genomic sequence, the alignment score of the
two characters should be determined from a matrix of scores found in a file with the
specified name.
|
java.lang.String |
toString()
Returns a
String representation of this Alignment . |
public Alignment(java.io.File file1, java.io.File file2)
file1
- a File
object referring to a FASTA file containing a genomic sequencefile2
- a File
object referring to a FASTA file containing a genomic sequencepublic Alignment(java.lang.String sequence1, java.lang.String sequence2)
sequence1
- a genomic sequence to be alignedsequence2
- a genomic sequence to be alignedpublic java.lang.String sequence1()
Alignment
.String
corresponding to the first of two genomic sequences in the pairwise alignmentpublic java.lang.String sequence2()
Alignment
.String
corresponding to the second of two genomic sequences in the pairwise alignmentpublic void computeAlignment()
Either the optimal global or optimal local pairwise alignment is computed. Computation of the optimal pairwise alignment includes determining the optimal pairwise alignment score as well as the corresponding alignment.
public int getAlignmentScore()
Alignment
.public java.lang.String alignmentTableToString()
String
representation of the alignment table generated during
alignment computation. This method is used primarily for debugging and is only useful
for small alignment tables, i.e., when aligning short sequences.String
representation of the alignment tablepublic java.lang.String backtrackTableToString()
String
representation of the backtrack table generated during
alignment computation. This method is used primarily for debugging and is only useful
for small backtrack tables, i.e., when aligning short sequences.String
representation of the backtrack tablepublic java.lang.String getAlignment()
String
representing the optimal pairwise alignmentpublic double getPValue()
Alignment
.
The p-value of this Alignment
is the probability (between 0.0 and 1.0) that the
optimal pairwise alignment score of two random sequences is greater than or equal to the
optimal pairwise alignment score for this Alignment
. A p-value close to 1.0 suggests
that an alignment was likely to have occurred merely by chance. A p-value close to 0.0 suggests
that an alignment was unlikely to have occurred by chance. In this case (especially if the p-value
is less than about 0.05), the alignment is significant and the sequences are deemed
similar.
public void setGlobalAlignment()
public void setLocalAlignment()
public void setFastAlignment(int numGaps)
For two sequences of length n, rather than use an O(n^2) algorithm that identifies the optimal alignment with any number of gaps, a FAST alignment runs in O(numGaps*n) time where numGaps is the number of gaps considered. This option is only available for global alignments.
numGaps
- at least this many gaps are considered when computing the optimal alignmentpublic void setLinearGaps(int linearGapScore)
Alignment
using a linear model.
With a linear model for scoring gaps, every gap is penalized the
same amount as specified by the linearGapScore
parameter.
linearGapScore
- the (negative) contribution to the alignment score of each gappublic void setAffineGaps(int alphaGapScore, int betaGapScore)
Alignment
using an affine model.
With an affine model for scoring gaps, the first gap in a sequence of
consecutive gaps is penalized by the alphaGapScore
parameter
whereas subsequent gaps in a sequence of consecutive gaps are penalized
by the betaGapScore
parameter.
Affine gap scoring is meant to model empirical biological evidence that the existence of a gap is more significant than the length of the gap. It is expensive, biologically, to add to or splice from a genomic sequence, but the length of the addition or deletion is less important.
alphaGapScore
- the (negative) contribution to the alignment score of initiating each sequence of gapsbetaGapScore
- the (negative) contribution to the alignment score of extending each sequence of gapspublic void setFixedScoring(int matchScore, int mismatchScore)
match
score. If the two characters differ then the alignment score of
the two characters should be the mismatch
score.matchScore
- the (positive) contribution to the alignment score of aligning two identical charactersmismatchScore
- the (negative) contribution to the alignment score of aligning two different characterspublic void setMatrixScoring(java.lang.String fileName)
In fixed scoring, all pairs of identical characters (e.g., G|G, C|C, T|T) are scored the same and all pairs of different characters (e.g., G|C, G|T, C|T) are scored the same. However, with genomic sequences, not all pairs of characters are equally similar or dissimilar. In matrix scoring, different pairs of identical characters (e.g., G|G, C|C, T|T) may be scored differently and different pairs of mismatching characters (e.g., G|C, G|T, C|T) may be scored differently. For example, adenine (A) and guanine (G) are both purines whereas cytosine (C) and thymine (T) are both pyrimidines. Since adenine is more similar to guanine than to thymine, an adenine aligned with a guanine (A|G) should not penalize an alignment as much as an adenine aligned with a thymine (A|T). Analogously for protein sequences, two different hydrophobic amino acids aligned together might not penalize an alignment as much as a hydrophobic amino acid aligned with a hydrophilic amino acid.
In matrix scoring, the alignment score of every possible pair of characters is specified in a matrix that must be read in from a file. For DNA sequences, since there are 4 characters in the DNA alphabet, there are 16 possible pairs of characters and the matrix contains 16 entries. For protein sequences, since there are 20 characters in the protein alphabet, there are 400 possible pairs of characters and the matrix contains 400 entries.
fileName
- the name of a file containing a matrix of alignment scores for all pairs of characterspublic void calculatePValue()
Alignment
.
The p-value of this Alignment
is the probability (between 0.0 and 1.0) that the
optimal pairwise alignment score of two random sequences is greater than or equal to the
optimal pairwise alignment score for this Alignment
. A p-value close to 1.0 suggests
that an alignment was likely to have occurred merely by chance. A p-value close to 0.0 suggests
that an alignment was unlikely to have occurred by chance. In this case (especially if the p-value
is less than about 0.05), the alignment is significant and the sequences are deemed
similar.
A p-value for an alignment of two sequences can be estimated as follows. Randomly generate 1000 pairs of sequences by randomly permuting the original two sequences. For each of the 1000 pairs of random sequences, determine the optimal pairwise alignment score. These 1000 scores approximate an extreme value distribution. Calculate the mean and standard_deviation of the 1000 scores. The mean and standard_deviation can be used to calculate two parameters, mu and beta, representing the extreme value distribution. mu can be calculated as mean - (0.5772*beta). beta can be calculated as standard_deviation * √6 / π. Finally, the p-value can be calculated as 1.0 - e^(-e^(-(x-mu)/beta)) where x is the optimal pairwise alignment score of the original pair of sequences.
public java.lang.String toString()
String
representation of this Alignment
.toString
in class java.lang.Object
String
representation of this Alignment
public static void outputHistogramOfRandomAlignmentScores(java.lang.String fileName, java.util.Vector<java.lang.Integer> v)
A histogram with 101 bins is created from the set of alignment scores
stored in the Vector v
. The histogram indicates the number
of alignment scores corresponding to each of the 101 bins. The histogram
is output to the specified file fileName
. The first column
of the output file represents the x-axis of the histogram, i.e., the
101 bins corresponding to 101 possible alignment scores. The second
column of the output file indicates the number of alignment scores corresponding
to each bin. The third column is a normalized version of the second
column, i.e., each entry in the third column is the corresponding
entry in the second column divided by the total number of alignment
scores. The fourth column represents a mathematical function - an
extreme value distribution - that approximates the third column. The
extreme value distribution is determined from the mean and standard
deviation of the set of alignment scores in v
.
fileName
- the name of a file to which the histogram will be outputv
- a Vector
of optimal pairwise alignment scorespublic static void main(java.lang.String[] args)
main
method creates an optimal pairwise alignment for two genomic sequences.
The main
method expects the Alignment
program is
executed with exactly two command line arguments: the names of two FASTA files each
containing a genomic sequence. The main
method creates an
Alignment
object based on the two genomic sequences and computes
the optimal pairwise alignment of the sequences.
args
- an array of Strings representing any command line arguments