Besides novelty filtering that was covered in Part 1, there is another interesting function that a Hebbian Layer can perform and it is called Principal Component Analysis (PCA). This time we are going to take a closer look at PCA and see how it can be used in Synapse in combination with regular neural networks. PCA is a linear transformation that can be used to reduce, compress or simplify a data set. It does this by transforming the data to a coordinate system so that the greatest variance of the data by a projection of the data ends up on the first component (coordinate), the next one in line on the magnitude of variance ends up on the second component and so on. This way one can choose not to use all the components and still capture the most important part of the data.
To understand what this means, we can take a look at a 2D example. Suppose we have some X-Y data that looks something like this:
To see how the data is spread, we encapsulate the data set inside an ellipse and take a look at the major and minor axes that form the vectors P1 and P2
These are the principal component axes - the base vectors that are ordered by the variance of the data. PCA finds these vectors for you and gives you a [X,Y] -> [P1, P2] transformation. While this example was for 2D, PCA works for N-dimensional data, and it is with high dimensionality problems it is generally used.
Let's take a look at how it can be used in practice, its limitations and how it is done with Hebbian learning:
Important: If you wish to try the solutions that are included here in Synapse, you have to make sure that you have the latest version of the Hebbian components as they have been recently updated as well as once before. To make sure that the components are updated, you must start Synapse with enabled automatic updates (on by default, can be controlled in Tools->Options), have a working Internet connection and if you are using a firewall make sure that Synapse is allowed to connect to the update server.
Most of the problems when it comes to data to be used for some form of statistical analysis can be reduced to two cases: too few samples or too many features (variables). PCA helps with the latter. Having too many features often results in the problem having too many degrees of freedom leading to poor statistical coverage and thus poor generalization. In addition each feature adds to a computational burden in terms of processing and storage.
Shouldn't a supervised neural network be able to do PCA on its own? Yes, they can, but there are three good reason why we should avoiding dumping that task on a neural net.
- Increased degrees of freedom (features) drastically increases the search space that the adaptation algorithm has to cover.
- Increased degrees of freedom increases the complexity of the search space. A more complex search space results in a larger number of local minima for the optimization to get stuck in and therefore to give suboptimal solutions.
- A neural net will for most problems do something similar to PCA. Taking away that task will allow the neural net to do the things that can't be done with PCA and thus its computational power is used in a better way.
Limitations of PCA
There are however some limitations with PCA that we should take into consideration. First of all it's a linear method. Basically the problem involves rotating the ellipsoid that we saw earlier in such a way that the direction of the variance of the data comes as the first component. Simplified, PCA does basically this:
Now this works fine as long as the X/Y relation is fairly linear. If we have a situation like this, we have a problem:
While the PCA still tries to produce components by variance, it fails as the largest variance is not along a single vector, but along a non-linear path. Neural networks on the other hand are perfectly capable of dealing with nonlinear problems and can on their own do this. In addition, they can do scaling directly so that the principal components can be scaled by their importance (eigenvalues):
In this case [X,Y] -> [g(P1), f(P2)] is a nonlinear transformation. So while PCA in theory is an optimal linear feature extractor, it can be bad for non-linear problems.
PCA using the Hebbian
PCA is widely used in statistics and there are traditional ways of calculating it, based on covariance matrices and singular value decomposition. The problem is that it is extraordinarily expensive in terms of processing power and memory and is useless for larger data sets. There is however an alternative, and that's using a Hebbian Layer with the right choice of update rules.
Oja's rule and the Maximum Eigenfilter
With regular Hebbian learning we have usually a problem - it diverges. Use it long enough and the values will go towards negative or positive infinity. To solve it, Erkki Oja, introduced what is now known as Oja's rule - a version that normalizes the weights so that they don't diverge.
The normalization of the weights has however another feature: it performs PCA, although limited to the first principal component. Training a Hebbian component with one output feature with Oja's rule gives us a projection of the data on the eigenvector of the autocorrelation matrix, or in other words, a projection on the primary principal component (the major axis of the ellipse in the illustrations above). This is called an Maximum Eigenfilter that finds the most information it can in an N-dimensional signal along a single component.
Generalized Hebbian Algorithm
Fortunately, there is another algorithm that will do a full PCA for us. It's called GHA or Sanger's rule. It is somewhat more expensive computationally speaking, but will provide us with the most significant principal components in an ordered fashion.
To demonstrate the difference between the Oja and the GHA take a look at this example:
Here we have a function source that outputs 20 equal sinus functions, one per feature. The top Hebbian Layer is trained with GHA and the bottom with Oja's rule. Since the functions are equal, there should be just one non-zero principal component, and that's exactly what the GHA produces. Oja's rule on the other hand gives us an array of differently scaled versions of the first principal component.
Is it meaningful?
Ultimately what we are interested in is hooking up PCA to a regular supervised neural network. The big question is: Is there a point to doing that?
The argument against is that a neural network should be able perform the same task - but better, because of its non-linear nature. So why bother with a PCA as preprocessing? Well, first of all, you may be getting an sub-optimal input to the net, but at the same time you can reduce the complexity of it. Second, the neural net is bound to do something PCAish anyway, so if you extract that part, more computing power is given to the net.
Finally, at least in Synapse, you don't have to make a choice between PCA or a supervised net, you can make a hybrid that handles both and picks the best combination of both.
Let's take a look at three different cases that we'll run on the Abalone data set (the same we used in the first tutorial):
This is a very common case that you'll find in the literature. Here the Hebbian Layer takes the full input, performs PCA and sends the result to an MLP neural net. Its popularity is due to that it is relatively simple to implement, so lesser software uses it. The problem with it, as we shall see is that, well, it is not good. It will consistently perform worse than an simple MLP topology for almost any data.
The plain Multilayer Perceptron net differs from the topology above in the way that the Hebbian Layer is replaced by a Weight Layer which is updated by a supervised gradient descent algorithm. This is a standard, plain neural network.
Fortunately Synapse's component based approach allows a third alternative where the PCA is done in parallel to the MLP. The advantage of this is that each branch will specialize. The PCA branch will do its thing and the other branch will do what the PCA can't handle. Should the PCA contribution be useless, the system will adapt to ignore it. If it on the other hand does something useful, it will be used and the other branch will compensate for the shortcomings it might have.
Performance on the abalone data set
If we run the systems in parallel on the abalone data sets we get the following results which are as we expected:
The serial PCA performs significantly worse than the other two, while the parallel PCA does somewhat better than the plain MLP.
The Hebbian Layer coupled with the GHA update rule can perform Principal Component Analysis (PCA). PCA is a transformation that can be used to reduce, compress or simplify a data set. While a standard neural network, like the MLP can do the necessary projection itself, in some cases doing a PCA in parallel and weighting it in can give somewhat better results as it simplifies the work for the rest of the system.
--Luka Crnkovic-Dodig / Peltarion