In 2003 the Human Genome Project was completed, mapping the entire human genome. The project was started in 1990 and was estimated to take some 30-40 years to complete. What the initial predictions missed was that DNA sequencing was subject to what Kurzweil refers to as the law of accelerating returns. The power of DNA sequencing has been increasing exponentially while the cost of the sequencing has been falling exponentially. Thus the project was completed much earlier than expected.
The Human Genome Project is however not the only completed full genome mapping. Thanks to the fast and cheap sequencing a wide range of other organisms have had their DNA fully sequenced. In this second part of this tutorial we will look at yeast and how its genes control its metabolic processes. We will use DNA microarray data to look at this problem.
In the first part of the tutorial we introduced competitive algorithms and focused on the Self-Organizing Map (SOM). We showed how it can be used to visualize high-dimensional data in a very intuitive way.
In this part we will move on to meta-clustering: how and why to cluster a SOM using a neural gas. You will also familiarize yourself with the unified distance matrix (U-Matrix). We will then move on to the actual problem involving the microarray data. After we have done our clustering with the SOM Visualizer we will create a standard neural network based classifier based on the results.
If you haven't gone through the first part, it is strongly recommended that you do so before proceeding.
Last year saw the 30th anniversary of Richard Dawkins' famous book, The Selfish Gene, the book that presented gene-centric evolution to a greater public. Controversial at the time, it is today a widely accepted theory that covers the connection between genetics and evolution through natural selection.
Dawkins' selfish gene should have the emphasis on gene - not selfish - as the primary point is that the gene is the basic unit that evolution through natural selection operates on. The selfish part is directly related to natural selection - genes that maximize their survival probabilities (which they do among other things through cooperation with other genes) live on in the gene pool while those less fit go extinct.
The title of this tutorial should be read in a different way: The Self-Organized Gene - emphasis on the self organized part. We are not going to be discussing the properties of the gene itself, but how gene functions can be analyzed using an adaptive method called self-organization. Our basic unit of operation won't be the gene, but the artificial neuron. In a way, there is a connection to the selfish part as well. While our units do not fall victim to natural selection, do not mutate and are not replicated, they do compete and interact which gives rise to the emergent global property of self-organization.
In more practical terms, in the tutorial we are going to explore unsupervised clustering of data. We will apply this to DNA microarrays, an exciting new technology that allows for very rapid expression profiling. We see how using adaptive self-organizing methods we can detect patterns in microarray data that can be used for understanding, detecting and fighting diseases caused by genetic factors.
In the first part of this tutorial we shall familiarize ourselves with the basic concepts of unsupervised learning, competitive learning and self organization. We shall also explore the self-organizing map as a powerful visualization tool and we'll take a look at a few simple examples to illustrate the principles.
In the second part of the tutorial we will cover meta-clustering before we move on to our target: the analysis of DNA microarray data. Once we have done that we will see how we can change our unsupervised clustering system into a supervised classification system in general and specifically in Synapse.
If you are interested in a very flexible rule-based system and want it to be easily integrated with for instance adaptive systems, then fuzzy logic provides a good solution.
So what is fuzzy logic?
Take a look at the first sentence in this text. Expressions like "very flexible", "easily integrated" and "good solution" were used. This type of vague expressions are characteristic of the way we humans communicate through language and as such is an integral part of our thinking process. This contrasts sharply with the traditional Boolean logic of computer programming. Fuzzy logic bridges the gap between them, providing a framework that allows you to numerically encode linguistic expressions and through that gives you a flexible rule-based system.
This is the first part of a two-part tutorial and will cover the basic fuzzy logic theory. The second part will be more practical and deal with fuzzy logic in Synapse and the creation of hybrid fuzzy-adaptive systems.
In this article we will take a closer look at three different classifiers and discuss three different types of classifiers: naive bayesian classifiers, support vector machines and modular multilayer perceptron neural networks.
To help us investigate the relative benefits of the system we're going to use a simple application that uses adaptive systems deployed from Synapse.
Besides novelty filtering that was covered in Part 1, there is another interesting function that a Hebbian Layer can perform and it is called Principal Component Analysis (PCA). This time we are going to take a closer look at PCA and see how it can be used in Synapse in combination with regular neural networks. PCA is a linear transformation that can be used to reduce, compress or simplify a data set. It does this by transforming the data to a coordinate system so that the greatest variance of the data by a projection of the data ends up on the first component (coordinate), the next one in line on the magnitude of variance ends up on the second component and so on. This way one can choose not to use all the components and still capture the most important part of the data.
To understand what this means, we can take a look at a 2D example. Suppose we have some X-Y data that looks something like this:
To see how the data is spread, we encapsulate the data set inside an ellipse and take a look at the major and minor axes that form the vectors P1 and P2
These are the principal component axes - the base vectors that are ordered by the variance of the data. PCA finds these vectors for you and gives you a [X,Y] -> [P1, P2] transformation. While this example was for 2D, PCA works for N-dimensional data, and it is with high dimensionality problems it is generally used.
Let's take a look at how it can be used in practice, its limitations and how it is done with Hebbian learning:
One of the most under-appreciated types of adaptive system are the ones that use Hebbian learning. It is because of their simplicity that they get ignored, but as we shall show, they do have some practical applications for which they are really good. In Synapse Hebbian learning is embodied by the Hebbian Layer component and the Hebbian and Oja's update rules.