In this article we will take a closer look at three different classifiers and discuss three different types of classifiers: naive bayesian classifiers, support vector machines and modular multilayer perceptron neural networks.
To help us investigate the relative benefits of the system we're going to use a simple application that uses adaptive systems deployed from Synapse.
What is a classifier?
A classifier is a system that performs a mapping from a feature space X to a set of labels Y. Basically what a classifier does is assign a pre-defined class label to a sample. For example, if you are building a spam classifier then the feature space contains a representation of an email and the label is either "Spam" or "Non-Spam".
This should not be confused with clustering where the algorithm autonomously partitions the data into clusters in a way so that the data in each cluster is grouped in feature space. Classification is a supervised method where you decide the classes while clustering is an unsupervised method where the algorithm groups the data autonomously.
A simple problem
To compare the different classifiers, we are going to take a look at a simple (or better to say limited) problem. While all forms of classifiers can take an unlimited number of features (variables) as input and many can classify an unlimited number of classes, we're going to study a problem that is easy to visualize. The classification problem we are going to study is going to have two input variables (X,Y) and two classes (Red, Blue).
Naive Bayes Classifier
The Naive Bayes Classifier (aka Idiot's Bayes) is a very popular algorithm due to its simplicity, computational efficiency and its surprisingly good performance for real-world problems. For instance most email clients such as Mozilla Thunderbird or Microsoft Outlook use naive bayes classifiers for filtering out spam emails.
The "Naive" attribute comes from the fact that the model assumes that all features are fully independent, which in real problems they almost never are. In spite of serious violations of the basic assumptions and the simplistic design of the classifier it turns out that they are very well suited for problems involving normal distributions, which are very common in real-world problems. In this example we will use hard margins for the classifier (i.e the result is either red or blue with no mention of degree of membership).
+Fast to train, fast to evaluate
+Surprisingly good for real-world problems
-Not capable of solving more complex classification problems
Support Vector Machine
Support Vector Machines or Kernel Machines are a relatively new set of algorithms. One could say that they are in many ways the exact opposite of the bayes classifier. They are extremely powerful, complex and are cumbersome to train and slow to evaluate. They are based on the idea that by increasing the dimensionality of the data, it gets easier to separate. In fact, the SVM uses an N-dimensional space, where N is the number of samples in the training set. This approach allows the SVM to classify problems with arbitrary complexity. The downside of it is that outliers can easily sabotage the classifier and its generalization capabilities can be questionable. In this example we will use hard margins for the classifier (i.e the result is either red or blue with no mention of degree of membership).
In addition of being slow to train it is slow to use and has large memory requirements. It works well for small data sets (as we will be using) but the computational requirements increase dramatically with the size of the training set. For data sets above 5000 samples they can be completely unusable even for a PC with a strong CPU. There are various tricks (like support vector limitation) that can be used to increase the speed, but we won't cover that here.
In Synapse the SVM is implemented by the SVM-KA component. It uses the Kernel Adatron SVM model which is reasonably fast (in comparison to other SVM models). The disadvantage of it is that it is a binary classifier - it can only distinguish between two classes. This can be overcome by using several SVMs in combination with logical gates:
+Extremely powerful non-linear classifier
-Computationally demanding to train and to run
-Sensitive to noisy data
-Prone to overfitting and thus bad generalization
-Choice of kernel function and the parameters have to be set manually and can greatly impact the results
Modular Neural Network
A modular MLP neural network is a traditional neural network but that has two or more separate branches. The branches allow for specialization and competition between local parts of the system. A neural network is slow to train, but very fast to use. To perform the classification it uses a gradient based optimization algorithm (Quickprop in this case) to adjust the parameters (weights, bias..) in the components). The neural net uses a combination of increasing the dimensionality (like the SVM) of the problem with a recursive superposition of basis functions.
The result is usually that it is good with continuous regions that can in a simple way be represented with the basis functions while it is not that good with regions with disconnected segments and complex boundaries. Due to that it has excellent generalization capabilities and a built-in robustness to noisy data. You can however force it to fit more complex patterns by excessively training it at the cost of its generalization capabilities.
In this case we'll be using a System deployed from Synapse that looks like this:
+Powerful non-linear classifier
+Elegant solutions built of continuous basis-functions
+Handles noisy data well
+Fast to run
-Computationally demanding to train
-Has difficulty with more complex boundaries
To demonstrate these classifiers, I've created a simple windows application where you can try out the algorithms for yourself. At it's core is a system deployed from Synapse that contains one bayes classifier, one SVM and one modular neural network.
Draw Area. You input Blue class values by left clicking and Red class values by right clicking.
Clears the draw area.
Saves the pattern in the Draw Area and the settings to file.
Loads a pattern from file.
Sets the number of training epochs for the Bayes classifier. As it is instantaneously trained, there is no need to increase that value.
Number of SVM training epochs.
Number of training epochs for the MLP.
The sigma parameter for the kernel of the SVM. Small sigmas = better exact solution, Large sigmas = more general solutions
Bar that shows the training process.
The results of the Bayes classification.
The results of the SVM classification.
The results of the MLP classification.
Starts the Bayes training and evaluation.
Starts the SVM training and evaluation.
Starts the MLP training and evaluation.
Now on to the fun part. All the patterns used are in the zip file in the "patterns" directory.
Let's start with the most simple situation where we have two classes that are confined to their own continuous region in the X-Y plane. (simple.ptn)
As we can see all classifiers handle that without any problems. In this case the Bayes classifier is the clear winner due to its speed and simplicity.
If we take a look at another simple pattern (simple2.ptn) that contains three distinct regions:
We can see here that while all three systems correctly classify the training data, their solutions are very different. The Bayes decides that everything belongs to the Blue class while the SVM does the exact opposite. The MLP comes up with the most reasonable and most simple solution and simply partitions the plane into three regions.
Ok, how about a slightly more complicated problem - the hello world of machine learning, the XOR. (xor.ptn)
Here the Bayesian clearly fails as it is simply incapable of classifying this type of topology. The SVM handles it relatively well. The MLP neural net solves the problem too. If you try re-training it a couple of times, you'll notice something interesting - it will come up with different solutions. This is because the system is over dimensioned for this kind of simple problem - there are far too many free parameters than needed. This means that there are many local minima to get stuck in and hence different possible solutions.
A similar problem, the checkers pattern (checkers.ptn) allows us to study the difference between the SVM and MLP solutions:
The Bayes fails here too. What is however interesting to see is the qualitative differences in the SVM solution compared to the MLP neural net solution. The SVM builds class boundaries by separating them with a non-linear hyperplane in a high dimensional space. Its solutions are therefore sound from a topological perspective of the training set. In short it can classify the training data set quite well. The problem with it is that it can do it a bit too well - to the degree that it misses out on data it hasn't seen and it is sensitive to outliers. The MLP neural net on the other hand attacks classification like a function approximation problem - it builds a model by the superposition of non-linear basis functions. This results in much smoother boundaries and as a consequence it is less sensitive to noise in data.
To give an example, here is a variation of the XOR problem, but that contains outlier points that really are just noise and not part of the main classification problem (xor_outlier.ptn):
As you can see the SVM has a small red region in the blue area and a small blue region in the red area as it blindly believed the training set. The MLP on the other hand blatantly ignored the outliers and focused on the big data regions.
This however also means that the SVM is capable of classifying more complex regions. If we take a look at this complex pattern (complex.ptn):
The SVM manages to correctly classify all, while the MLP neural net fails in certain parts. There is however one catch with the SVM - the choice of kernel function and the kernel function's settings. In this case we are using a kernel called KMOD, which is a variation of a Gaussian kernel. It has one very relevant parameter - sigma, which determines the resolution of the region.
If you set the sigma to a low value, you'll get good accuracy, but no generalization. If you set it to a high value, you get good generalization, but bad accuracy. In this demo problem it is simple - you do some trial and error. In more realistic problems where you have many input variables, this can be difficult to do.
Let's look at that closer through the dots pattern (dots.ptn)
As you can see, the sigma is set to 3, which gives a very tight fit around the data points. The MLP neural net on the other hand gives a good generalization (and actually draws the regions exactly like what I had in mind when I drew the points). Suppose we now increase the sigma by a factor ten to 30:
As you can see, the solution is more general, but far less accurate. You can to a smaller extent do the same thing with the MLP network by the number of epochs you train it. Basically, the more you train it, the more it will learn the training set and hence increase its resolution at the cost of generalization.
Finally, play around with the program yourself to get a more intuitive understanding of how the different classifiers behave.
And in Synapse, remember that there's nothing preventing you from combining all the classifiers into one system, for instance something like this:
The Naive Bayes classifier is simple, fast and of limited capability when it comes to anything but the most trivial cases of classification. Fortunately many real-world problems fall in exactly that category, so you shouldn't rule it out.
The Support Vector Machine is complex, slow, takes a lot of memory, but is an immensely powerful classifier. The SVM-KA also has the limitation of being a binary classifier. SVMs are generally unsuitable for larger trainings sets as their speed of execution decreases with the size of the training set. However, its superiority in the classification of complex patterns should not be underestimated. (Perhaps making an SVM based spam filter could be an interesting tutorial).
The MLP neural network is good at capturing fairly complex patterns while keeping good generalization capabilities. While it is slow to train, it is fast to use and its execution speed is independent of the size of the data it was trained on. It is very well suited for more complex real-world problems and thus on average superior to both the SVM and the Bayes classifier.
Ultimately however, it all depends on depends on your problem, and with Synapse, you can easily combine the three classifiers into one system.
Luka Crnkovic-Dodig / Peltarion