What is Clustering ?

    Clustering is the process of deciding which genes behave similarly and should be thus considered members of a cluster of genes, based on their observed expression profiles.
    Clustering arbitrary data into clusters of similar items presents the difficulty of deciding what constitutes a good clustering. It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.
    In the particular case of gene expression clustering, the desirable outcome of clustering is identification of functional groups of genes, i.e., identification of clusters of genes involved in common biological processes.

ArrayMiner uses two intuitively appealing criteria of clustering quality, namely clustering for minimal total variance of the clusters or a maximum probability of a Gaussian mixture model of the data.

    It may seem that defining a good criterion for clustering quality would be sufficient for obtaining good clusters of genes. However, this is very far from being the case: finding clusters of genes of minimal total variance or maximal Gaussian probability is not easy, and not all tools are equally good at the task.

    In fact, clustering for minimal total variance or maximal Gaussian probability is a provably difficult computational problem (for those well versed with computer science: the problem is NP-Hard). This implies in particular, that fast one-shot heuristics, such as the popular method of k-Means (Web Based), simply cannot be expected to perform well on this task. As a result, they will supply clusterings of poor quality with high probability.

    For the biologist who runs the clustering software, the quality of the clustering is of significant importance, as he or she interprets the clusters as associations of genes that behave similarly. Hence considering a clustering of poor quality will lead the biologist into a painful examination (and, hopefully, rejection) of hypotheses purportedly explaining the bogus associations suggested by the ill formed clusters in that clustering, causing a potentially serious waste of time and resources. Conversely, a poor clustering obviously means that better ones are not supplied. It thus prevents the biologist from examining the probably more useful associations which would be otherwise suggested by a high-quality clustering. This may cause the biologist to miss important biological phenomena, a potentially serious hindrance on their research.

Following the above observations, ArrayMiner puts a premium on the quality of the clusterings it supplies. Indeed, we are convinced that it is far better for the biologist to wait a few minutes for a high-quality clustering that will constitute a solid and reliable basis for their research, than to get a poor solution in seconds and then waste months of futile effort researching the bogus associations of genes suggested by the poor clustering.
To achieve the goal of high quality, ArrayMiner relies on the proprietary technique of Grouping Genetic Algorithms (GGA). Developed by Optimal Design over the last decade specifically for clustering problems, and fine-tuned within ArrayMiner for the specific task of clustering of gene expression profiles, the GGA supplies clusterings of excellent quality within reasonable execution times. The GGA is a special variety of the Genetic Algorithm (Web Based) technique.

The difficulty of the expression profile clustering problem has one practical consequence: it is impossible to know with certainty whether a given clustering is the best possible one. The only way to achieve that certainty would be to examine all possible clusterings there could be, yet that is impossible to do, as there are billions of them for even modest numbers of expression profiles. Since ArrayMiner strives to find the best clustering of the profiles, it will always try to improve the one it currently has, even if it is the best possible one. Several options are available for deciding when to stop the algorithm and adopt the currently available clustering as the final one.

It is worthy noting that the problem of deciding when to stop the optimization process is irrelevant in most other clustering software. The reason of this is that the clustering methods built into that software are in most, if not all, cases simple heuristics (e.g. the k-Means method), rather than a well defined optimization procedure like the GGA. Those methods are therefore simply stopped when they “go out of steam” and are unable to search for better clusterings. The obvious flip side is that they often end up with poor quality clusterings, with the potentially serious consequences to the user discussed above.