Last year saw the 30th anniversary of Richard Dawkins' famous book, The Selfish Gene, the book that presented gene-centric evolution to a greater public. Controversial at the time, it is today a widely accepted theory that covers the connection between genetics and evolution through natural selection.
Dawkins' selfish gene should have the emphasis on gene - not selfish - as the primary point is that the gene is the basic unit that evolution through natural selection operates on. The selfish part is directly related to natural selection - genes that maximize their survival probabilities (which they do among other things through cooperation with other genes) live on in the gene pool while those less fit go extinct.
The title of this tutorial should be read in a different way: The Self-Organized Gene - emphasis on the self organized part. We are not going to be discussing the properties of the gene itself, but how gene functions can be analyzed using an adaptive method called self-organization. Our basic unit of operation won't be the gene, but the artificial neuron. In a way, there is a connection to the selfish part as well. While our units do not fall victim to natural selection, do not mutate and are not replicated, they do compete and interact which gives rise to the emergent global property of self-organization.
In more practical terms, in the tutorial we are going to explore unsupervised clustering of data. We will apply this to DNA microarrays, an exciting new technology that allows for very rapid expression profiling. We see how using adaptive self-organizing methods we can detect patterns in microarray data that can be used for understanding, detecting and fighting diseases caused by genetic factors.
In the first part of this tutorial we shall familiarize ourselves with the basic concepts of unsupervised learning, competitive learning and self organization. We shall also explore the self-organizing map as a powerful visualization tool and we'll take a look at a few simple examples to illustrate the principles.
In the second part of the tutorial we will cover meta-clustering before we move on to our target: the analysis of DNA microarray data. Once we have done that we will see how we can change our unsupervised clustering system into a supervised classification system in general and specifically in Synapse.
In previous tutorials we have covered supervised learning in various forms. One typical problem that is tackled with supervised algorithms is classification. Here we shall take a look at classification's unsupervised cousin - clustering.
So what exactly are supervised and unsupervised learning? Supervised learning (also called "learning with a teacher") is when we train an adaptive system by giving it examples of what we want it to do. In essence we present it with an input and tell it what output we would like to see associated with that input. For instance we show the system a picture of an apple or a banana and tell it what it is. After being shown a number of apples and bananas (and told what they are), the system will hopefully be able to distinguish between them when shown a new batch of pictures of apples and bananas.
Unsupervised learning is a different animal. Instead of teaching the system by example we just unload data on it and let the system itself sort it out. Instead of saying "this is an apple" and "this is a banana", we just show the pictures and let the algorithm work out that there are in fact two types of entities (clusters) in the data.
So how could the system possibly know what constitutes two "different" clusters? It is not that difficult - as a matter of fact, it is implied in the question. Different means that there is difference - or distance, in geometric terms. Let me illustrate, using our apples and bananas example. Suppose we have the fruit described by two features: shape and color. The shape variable varies from round to elongated while the color variable varies from yellow to green. Imagine now that we have a few apples and bananas as data points and plot them in a graph:
Even without knowing what bananas and apples are supposed to look like and although there is variation in shape and color, it is quite easy to see that there are two separate clusters in the data. We are able to partition the data by analyzing relative distances in the plot. There are two relevant distance measures - inter cluster and intra cluster distances. The former refers to the distances between points within a cluster and the latter to the distance between clusters. In the following plot we can see the intra cluster distances marked with red and inter cluster distances with blue:
Ultimately a successful clustering depends on minimizing the intra cluster distances and maximize the inter cluster distances. Given how simple this looks, what is the big deal?
Well, in this case we used two variables - shape and color. This is a two dimensional feature space. We could increase that to 3D by adding a feature (variable) and we would still be able to distinguish the clusters with relative ease (although it might be difficult in practice on a 2D computer screen, if you can't interactively rotate the plot). For us humans it becomes impossible once we go beyond the three dimensions that our brain can visualize. Adaptive clustering algorithms have no such problems - two or a hundred dimensions, it's all the same to them.
Feature space is a name that we will come back to many times in this text, so it needs to be demystified. Feature equals variable. If a data sample has three parameters - for example width, height and length, then the point is in a three dimensional feature space. That space has three axis, width height and length. Any point in that space is described by those three features.
So how do these adaptive algorithms work? The basic principle behind these systems is called "competitive learning" and is quite different from other neural network algorithms. The basic unit of operation that we shall familiarize ourselves with is most commonly known as an "artificial neuron" which is a very misleading name as it has nothing at all in common with either biological neurons or the artificial neurons found in regular artificial neural networks. A more suitable name would perhaps be "artificial particle" or "agent", but in this text we shall use another name that is also sometimes used in the literature: "unit". While it may be a very non-descriptive term, it is better than the misleading "artificial neuron" and more convenient than making up new names for it.
So what are these mysterious "units"? Well, essentially you could say that they are a special type of data points. The big difference between them and the regular data points is that the units are dynamic - they have active behavior and change their position based on rules and algorithms. The most simple type of unit, the basic competitive unit can be seen as a lonely data point -seeking the company of other data points. It is attracted to data, and the nearer another data point lies, the more attracted the unit gets to it and moves towards it. You may see where this is going - if a unit is close to a cluster of data points, it will gravitate towards it. Other, more distant clusters will attract it as well, but with a much weaker force.
For the mathematically inclined, the change in the position of the unit for each data point can be expressed as follows:
p = a(p-x)d(p,x)
The vector p is the position of the unit and the vector x is the position of the data point. The "a" is a factor called "learning rate" and it regulates how fast the unit will move towards the data point. The function d(p,x) is a distance scaling function - the larger the distance between p and x, the smaller d(p,x) will be. Expressed in plain English, it would be something like this: the change in position of the unit equals a learning rate multiplied with the difference between the positions of the unit and the data point scaled by the distance between the unit and the data point.
Mathematics aside, a visual demonstration never hurts. Observe the animated gif below: the small black dots are data points while the large green dots are the units. Look how they move:
(For the fully interactive Java applet this was recorded from see this.)
Simple, yet efficient: we just sprinkle the feature space with units and let them move around. After a while, the positions of the units will be representative of cluster centers.
The time honored tradition in the field of neural networks is that for each slight modification of an existing algorithm, you give it a brand new flashy name. Competitive learning is no exception - there are tons of minor varieties that go under different names, but they are all essentially based on the same basic principle of units moving in the general direction of nearby data points.
There is however one clever variety that we will be focusing on.
The Self-Organizing Map
The self-organizing map (SOM) is also known as a Kohonen map, named after its inventor Teuvo Kohonen. It is based on competitive learning and we have our familiar "units" that seek companionship of other data points. There is however a twist to this story: the units are no longer unbound agents free to roam feature space as they wish. Instead they are all interconnected in a grid. When a unit tries to run away in a direction, it will be pulled back by the strings that are attached to neighboring units in the grid.
One of the problems with plain competitive learning is that once the units have converged to their final positions, all you can get is the coordinates for cluster centers - you still have no real idea of how feature space looks like and what the relationships between the different clusters are. By tethering the units together in a grid, we can force them to keep formation. One unit will have some form of relationship in feature space to the units next to it in the grid.
Time for an animated example - the premise is the same as before: black dots= data, green dots = units:
You can see how while the map gets deformed as the individual units run off in the search for data points it still keeps a global structure. A unit's neighbors remain its neighbors after adaptation. In the animated example the units are connected in a rectangular grid but other forms are possible. Later, when we come to visualization examples we will actually use a hexagonal grid. But before we explore the usefulness of the SOM, we'll go through a bit more theory.
It is important to understand that while the map itself is two dimensional (i.e. a grid), the units in the map have the same number of dimensions as the feature space (the number of features). Here is an example of a SOM being stretched across a three dimensional feature space (features x,y,z):
As you can see, the map is still a plane, but each point in it has a coordinate with three dimensions. As was mentioned earlier, these maps become useful when we have more features. Although it isn't possible to directly visualize it, the principle remains the same when the number of dimensions increases.
The usefulness of the SOM stems from the grid connections between the units. It makes sure that through the grid, the topological properties of the data in feature space are preserved. We get a map of how the features (variables) are connected in the data. The map is then used for visualization.
Visualization with the SOM
This section might seem a bit complicated at first, but it is in fact quite simple when you look at it from the right angle. We are going to take it step by step because it is important for the rest of the tutorial for you to understand how a SOM is "read".
Suppose we have a 3x3 SOM (i.e. 9 units in a square grid) that has been adapted and twisted to fit a 2D data set. It could look something like the left part of this picture:
All the units in the grid on the left side of the pictures have been assigned numbers. On the right side there is a 3x3 grid with each cell corresponding to one unit in the SOM. Reading from the graph on the left, we will fill out the boxes on the right in a meaningful way. We'll call the table on the right side a "Maplet".
The feature-space that the SOM above occupies has two dimensions: X and Y. Suppose we created one maplet for X and one for Y and filled each cell with a color that corresponds to the value of the unit associated with the cell (for the corresponding feature):
Red means high value and blue low value.
Let's go through it once more: Each cell corresponds to a unit. In the "X" maplet the cell colors correspond to the X value for each unit and the "Y" maplet the same goes for the Y value. If we go back to the left side plot in the previous figure and look for instance at units 1 and 9:
Unit 1: Low X value, High Y value => X Maplet, Cell 1 Blue, Y Maplet, Cell 1 Red
Unit 9: High X value, Low Y value => X Maplet, Cell 9 Red, Y Maplet, Cell 9 Blue
This is all there is to it. While the example above is underwhelming and was used just to illustrate the principle, this method of visualization is very powerful as we shall see. The SOM creates a map in the word's real sense. It is a map of the data and of the feature space. Due to the connections between the units, the map is topologically intact meaning that you will get sorted continuous values on the map.
You can ignore the two first maplets for now - we'll deal with them in the second part of this tutorial. You might notice that this SOM has a hexagonal grid - it's better for visualization as it allows smoother transitions between the unit cells. The black circles inside the unit cells indicate how many data points are close to a particular unit. Larger circle means more data. This is a 15x15 grid in a 10-dimensional feature space.
So what can it tell us, just by glancing at it? Well, if we for instance look at the Churches feature and focus on the red area we can make the following conclusion by looking at the same unit cells for other features:
Where there are many churches per capita there are few schools per capita, many cows, the population is large and the median age is fairly low
The screenshot is of the SOM View visualizer in Synapse. In Synapse it is of course interactive and can be hooked up to other visualizers in a usual way. One interesting thing one can explore is hierarchical SOM visualization - using a secondary or a tertiary etc visualizer on a selection.
If you wish to try this in Synapse (recommended as interactivity in the data exploring means a lot), I would recommend that you start with tutorial 1 and tutorial 2 so that you learn the basics. If you don't have Synapse, you can download it here.
Conclusion Part 1
In this first part of this tutorial we have seen how competitive algorithms in general and specifically the self-organized map (SOM) can be used to cluster data autonomously. We've seen how the SOM can be used to visualize multidimensional data by creating a map of the feature-space.
In the second part we will be looking at how you can extract clusters from a SOM by applying a second competitive algorithm on it. We will then proceed with the DNA microarray example and see how the SOM can automatically detect meaningful genetic groupings. Finally, we will show how you can proceed to create a supervised classifier based on the clustering that the unsupervised algorithm has provided.
Luka Crnkovic-Dodig / Peltarion