public class KMeans_Clustering extends Clustering
KMeans_Clustering
class represents a
clustering (i.e., grouping or partitioning) of a collection of genes using
the k-means clustering method.clusters, experiments, genes
Constructor and Description |
---|
KMeans_Clustering(String fileName,
int numClusters)
Creates an initially empty
KMeans_Clustering . |
Modifier and Type | Method and Description |
---|---|
boolean |
assignGenesToClusters(java.util.Vector<java.util.Vector<Double>> means)
Assigns each gene to the cluster whose mean expression vector is closest to the gene.
|
java.util.Vector<java.util.Vector<Double>> |
getMeansOfAllClusters()
Return a collection of the mean (average) expression vectors for all of the clusters.
|
void |
initializeAllClusters()
Initializes each cluster so that each cluster contains zero genes.
|
void |
kMeans()
Performs k-means clustering.
|
static void |
main(String[] args)
The
main method creates a Clustering based on gene and experiment data from a tab-delimited text file. |
void |
populateEmptyClusters()
If any clusters are empty (contain zero genes), then genes are moved from clusters containing multiple genes.
|
void |
randomlyAssignGenesToClusters()
Assigns each gene to a random cluster.
|
getExperimentNamesFromFile, getGeneInformationFromFile, getNumClusters, getNumExperiments, getNumGenes, toString
public KMeans_Clustering(String fileName, int numClusters)
KMeans_Clustering
.
A set of genes and experiments are determined from the specified String
representing the name of a file.
Genes and experiments are read-in from the tab-delimited file. Initially, the constructed
KMeans_Clustering
consists of the specified number of clusters,
but each cluster has not yet been assigned any genes.
fileName
- the name of a tab-delimited text file containing gene and experiment datanumClusters
- an integer representing the desired number of clusterspublic void initializeAllClusters()
public java.util.Vector<java.util.Vector<Double>> getMeansOfAllClusters()
Vectors
such that each Vector
in the collection represents a collection of expression valuespublic void populateEmptyClusters()
Assumes all genes have been assigned to a cluster. For any empty clusters, a cluster with mulitple genes is found randomly and one gene is removed from the multiple gene cluster and added to the empty cluster. When the method completes, no cluster contains zero genes.
public void randomlyAssignGenesToClusters()
Each gene is randomly assigned to one of the clusters. When the method completes, no cluster should contain zero genes.
public boolean assignGenesToClusters(java.util.Vector<java.util.Vector<Double>> means)
The parameter means
corresponds to a collection of Vectors
(analagous
to a 2D array). Each entry of means
corresponds to the set of average expression
values for a particular cluster.
The method returns true
if the cluster assignments are not improving, i.e.,
we have achieved a locally optimal clustering.
The method returns false
if the cluster assignments are an improvement over
previous cluster assignments, i.e., better clusters than previously have been found.
The measure used for comparing one clustering to another clustering is the sum of
the distance of each gene from its cluster's mean.
means
- a collection of Vectors
where each Vector
corresponds to a set of expression valuesboolean
indicating if the cluster assignments have improved (false
) or not (true
)public void kMeans()
Initially, genes are randomly assigned to clusters. Then, iteratively, the clustering is improved until a local optima is reached (until clusterings are no longer improving). In each iteration, first the mean (average) expression vector is calculated for each cluster. Second, each gene is assigned to the cluster whose mean (average) expression vector is closest to the gene's expression vector.
public static void main(String[] args)
main
method creates a Clustering
based on gene and experiment data from a tab-delimited text file.
The clustering is determined using k-means clustering. The text file and the desired number of clusters are specified as command line arguments. The computed set of clusters is output to standard output (and can be redirected to a file).
args
- an array of Strings
representing any command line arguments