In 2003 the Human Genome Project was completed, mapping the entire human genome. The project was started in 1990 and was estimated to take some 30-40 years to complete. What the initial predictions missed was that DNA sequencing was subject to what Kurzweil refers to as the law of accelerating returns. The power of DNA sequencing has been increasing exponentially while the cost of the sequencing has been falling exponentially. Thus the project was completed much earlier than expected.
The Human Genome Project is however not the only completed full genome mapping. Thanks to the fast and cheap sequencing a wide range of other organisms have had their DNA fully sequenced. In this second part of this tutorial we will look at yeast and how its genes control its metabolic processes. We will use DNA microarray data to look at this problem.
In the first part of the tutorial we introduced competitive algorithms and focused on the Self-Organizing Map (SOM). We showed how it can be used to visualize high-dimensional data in a very intuitive way.
In this part we will move on to meta-clustering: how and why to cluster a SOM using a neural gas. You will also familiarize yourself with the unified distance matrix (U-Matrix). We will then move on to the actual problem involving the microarray data. After we have done our clustering with the SOM Visualizer we will create a standard neural network based classifier based on the results.
If you haven't gone through the first part, it is strongly recommended that you do so before proceeding.
In the previous part we looked at how a SOM wraps around the data to cover it topologically. Specifically, you might recall this animation:
Last time we looked at visualization but now we are interested in having the data automatically divided into clusters. If I manually do it, this might arguably be the result:
I'm saying "arguably" because there are many ways of actually saying what constitutes a cluster. Take a look at the data in the lower left corner. Is that one cluster or two? Your opinion is as good as mine - there is no absolute criterion. We'll return to this problem a bit later. For now we can see that the SOM doesn't do a very good job at the precise clustering. We clearly can't take each unit and call it a cluster:
At this point we can do a number of things in our quest of producing usable clusters out of the SOM organization. One way is essentially to do the same thing all over again - use an adaptive algorithm to cluster the results from the SOM. We'll use a neural gas to cluster the SOM nodes. We could run it on the data directly, but this is faster as the SOM has done most of the dirty work already. While some competitive algorithms don't like to get clustered, the SOM is a pretty good sport about it. For the task, we'll use its cousin, the neural gas algorithm. A neural gas is very similar to a SOM, except it literally has no strings attached. The units are free to move in any direction regardless of the other units.
Once we have clustered the SOM units with the gas we can use some clever algorithms to eliminate superfluous units. We do that by merging units that are too close to each other into a common cluster. The principle is that we start off with a large number of units in the gas and then reduce that number.
This gives an automatic method of clustering: First the SOM clusters the data, then the neural gas clusters the SOM and finally the gas units are post processed and merged into clusters.
So how exactly do we merge the gas units? Well, we have two things to work with. The first one relates to the principle that intra-cluster distance (within the cluster) should be small and inter-cluster distance (between the clusters) should be large. That way we can devise a relative metric of what constitutes a good cluster. The second clue we get from the SOM through the use of a visualization technique called the U-matrix or unified distance matrix. It visualizes the distance between adjacent units in the SOM.
Recall the maplets from the first tutorial:
Now instead of the color representing the value of the unit for a particular variable, imagine if we instead showed the average distance of that unit to other units. You would end up with something like this:
It might sound odd at first, but it is really simple. The dark areas are where the distance between the nodes is large - it's a gap. The light areas are where the nodes are close to each other. So we can read the U-matrix and relatively easy see that there are three white areas separated by a dark region. You can think of it as bits of land separated by a ditch. In the case above there are hence three clusters covered by the SOM. The U-matrix is an important clue when you try to determine how many natural clusters the SOM is covering.
The final decision of what constitutes a cluster is a combination of the two techniques.
Now you should understand how to read and interpret a SOM displayed like this one in Synapse:
As an introduction to the real-world problem we are going tackle, I'll shamelessly borrow the text from Wikipedia:
|A DNA microarray is a collection of microscopic DNA spots, commonly representing single genes, arrayed on a solid surface by covalent attachment to chemically suitable matrices. DNA arrays are different from other types of microarray only in that they either measure DNA or use DNA as part of its detection system.Microarray technology is often used for gene expression profiling. It makes use of the sequence resources created by the genome sequencing projects and other sequencing efforts to answer the question, what genes are expressed in a particular cell type of an organism, at a particular time, under particular conditions? For instance, they allow comparison of gene expression between normal and diseased (e.g., cancerous) cells. There are several names for this technology - DNA microarrays, DNA arrays, DNA chips, gene chips, others. Sometimes a distinction is made between these names but in fact they are all synonyms as there are no standard definitions for which type of microarray technology should be called by which name.|
If you don't understand parts of that, don't worry, it doesn't matter. The problem is fairly general in nature and can be understood without having intimate knowledge of the science and technology behind it.
The core of the problem is simple - you have a number of DNA samples and would like to know how the genes operate together. Are there groups of genes that perform some task? Which genes are related to which other genes? How does the genetic information tie into effects on the phenotype (the organism)?
These questions can be answered through clustering, classification or a combination of both.
Meet the Yeast
The data we're going to look at deals with something as exotic as gene expressions involved in yeast metabolism. Specifically the data deals with something called diauxic shift.
(Mrs. Saccharomyces cerevisiae - yeast)
When inserted into a medium with glucose, the yeast can convert the glucose to ethanol. The yeast runs on ethanol as the preferred direct power source. The shift from the fermentation of glucose to the respiration of ethanol is the diauxic shift. The process involves major changes in gene expressions as the metabolic pathways are turned on and off.
The data set which can be downloaded here has 834 rows, each one representing the expression levels of a gene. It has 8 columns, seven (T0..T6) which represent different times in the process and one column with gene designations. So we have a data set that describes the expression levels of different genes at 7 different times in the diauxic shift process. The expression levels are a measure of how active the genes are.
Here is a Synapse solution with the data set loaded. Note that the gene names have been enumerated by the CSV format that loads the data. This column with gene designations is for observation only - in case we later want to see which genes are in which cluster. (For people with bioinformatic tendencies, "Gene designation" is actually the ORF - open reading frame.)
Apparently, if one is to follow the example of the general media, it is not possible today to mention some scientific research without giving an example of a practical use for it. So here it goes: the genetic background of the metabolic system of this particular yeast (S. cerevisiae) is important for medical research. Understanding of the genetic level of the metabolism has made it possible to design medicines such as better hepatitis B vaccines.
Ok, let's do some clustering. In Synapse we drop the data into a SOM Visualizer. Before we start the clustering, we must make sure that we don't use the gene designation ("GENE" column) in the clustering. We deactivate it in the "Feature configuration" dialog of the visualizer:
Now to the fun part - if we press the play button we get something like this:
We have exactly two well-defined clusters. What is even better is that we can see directly what it means. The map has in a beautiful way sorted the data so that it almost looks as if you have seven frames of an animation of the genetic expressions over time. On the actual DNA or the microarray, it looks of course nothing like that, but here in pure feature space ordered by a SOM we have an elegant overview of the process, step by step.
If we select the red cluster we can see how it behaves across T0-T6 (the cells with the white borders are the selected ones):
We can clearly see that the red cluster (= Cluster 1) is completely active at T0, T1 and T2 (red=high activity) and not at all at T5 and T6 (blue=low activity). Those are the genes responsible for burning the glucose. The blue cluster (=Cluster 0) contains on the other hand the alcoholic genes (that use the ethanol). The metabolic switch can be clearly seen in T2-T4 where genes from both clusters play their part.
We could at this point for instance select the units that are active during T2-T4 and connect a new SOM to the selection. That way we could cluster just the genes active during the transition. But, as the process is identical to the one we just went through, I'll leave that to the interested reader to do.
Building a classifier
Instead we're going to convert the SOM into a usable classifier. First we convert the SOM into a filter that will end up on the filter stack. In the "Options" menu select "Apply as filter". Two new features have been created - "Cluster 0" and "Cluster 1" which are generated by the SOM that now lives its life as a preprocessing filter. Since we've deduced what our clusters are (Glucose processing and Ethanol consumption) we can rename the generic cluster names to "Glucose" (Cluster 1) and "Ethanol" (Cluster 0) using the rename filter.
You can download a Synapse solution here that contains what we have done so far (SOM clustering filter and renaming).
We now move off to design mode to build a supervised classifier. We could use a Support Vector Machine, but that would be overkill. A Bayesian classifier would most likely do, but we'll go with a simple one layer neural network:
Input features are T0, T1, T2, T3, T4, T5, T6 and output features are Glucose, Ethanol. The "GENE" feature is discarded as it contains no relevant information for the classification. (Synapse solution here)
We configure the delta terminator to show a confusion matrix and go to training and run the system:
100% correct classification on the validation set is obviously as good as it can be, so there is little to be said. What we now have is a glucose/ethanol gene classifier. Input the expression values at the different times and it will tell you if the gene is responsible for the glucose fermentation or the ethanol processing. (Synapse solution here)
Self-Organizing maps are powerful adaptive algorithms that can be used both for visualizing and clustering data. In this tutorial we have looked at an example of how SOMs can be used in bioinformatics - specifically with DNA microarray data. We created a clustering model whose results we used to train a simple neural network as a classifier. Given a gene's expression at different times the classifier could decide if the gene was involved in the glucose fermentation process or the ethanol respiration.
Reverse-engineering something that has evolved along weird and unpredictable paths for billions of years is not trivial. However, with computers and the right algorithms as well as tools such as the DNA microarrays it is becoming possible. In the past we had to rely on models that we could deal with directly with our minds. In the mechanic model of the universe reverse engineering meant taking something apart to its basic parts and to get a model in your mind of how the parts work together.
The late Douglas Adams expressed it well: "If you try and take a cat apart to see how it works, the first thing you have on your hands is a non-working cat. Life is a level of complexity that almost lies outside our vision."
With the computers and a bit of clever algorithms we can actually reach, process and understand that level of complexity - without necessarily breaking the cat. In this example, the computer with Synapse on it clustered a seven dimensional data set; something a human would have great difficulty with. The computer couldn't have cared less if there were 2 or 2,000 dimensions. What we can accomplish today is mind staggering - and we have barely begun to scratch the surface. Living things are no longer at an unreachable complexity level. Although bioinformatics is in its infancy there are essentially no real obstacles in the way anymore.
It's a brave new world and it's really rewarding to be able to contribute on the computation side of things. As these methods are relatively new (at least compared to the traditional math models), not all that many people are aware of them yet. The great thing about that is that often you get to demonstrate one of these methods to someone and then they go off and solve some previously unsolvable real-world problem that you would never have thought of.
At this point, if you haven't already, you might want to download Synapse and try out the SOMs on your own data. Other general tutorials on adaptive systems can be found on the blog and Synapse specific, more practical tutorials can be found here.
Luka Crnkovic-Dodig / Peltarion