Section 12.1 On Vectors
From the previous chapter you may remember that data mining techniques fall into two large categories: supervised learning techniques and unsupervised learning techniques. The association rules mining examined in the previous chapter was an unsupervised technique. This means that there was no particular criterion that we were trying to predict, rather we were just looking for patterns that would emerge from the data naturally.
In the present chapter we will examine a supervised learning technique called "support vector machines." Why the technique is called this we will examine shortly. The reason this is considered a supervised learning technique is that we "train the algorithm on an initial set of data (the "supervised" phase) and then we test it out on a brand new set of data. If the training we accomplished worked well, then the algorithm should be able to predict the right outcome most of the time in the test data.
Take the weather as a simple example. Some days are cloudy, some are sunny. The barometer rises some days and fall others. The wind may be strong or weak and it may come from various directions. If we collect data on a bunch of days and use those data to train a machine learning algorithm, the algorithm may find that cloudy days with a falling barometer and the wind from the east may signal that it is likely to rain. Next, we can collect more data on some other days and see how well our algorithm does at predicting rain on those days. The algorithm will make mistakes. The percentage of mistakes is the error rate, and we would like the error rate to be as low as possible.
This is the basic strategy of supervised machine learning: Have a substantial number of training cases that the algorithm can use to discover and mimic the underlying pattern and then use the results of that process on a test data set in order to find out how well the algorithm and parameters perform in a "cross validation." Cross validation, in this instance, refers to the process of verifying that the trained algorithm can carry out its prediction or classification task accurately on novel data.
In this chapter, we will develop a "support vector machine" (SVM) to classify emails into spam or not spam. An SVM maps a low dimensional problem into a higher dimensional space with the goal of being able to describe geometric boundaries between different regions. The input data (the independent variables) from a given case are processed through a "mapping" algorithm called a kernel (the kernel is simply a formula that is run on each caseβs vector of input data), and the resulting kernel output determines the position of that case in multidimensional space.
A simple 2D-3D mapping example illustrates how this works: Imagine looking at a photograph of a snow-capped mountain photographed from high above the earth such that the mountain looks like a small, white circle completely surrounded by a region of green trees. Using a pair of scissors, there is no way of cutting the photo on a straight line so that all of the white snow is on one side of the cut and all of the green trees are on the other. In other words there is no simple linear separation function that could correctly separate or classify the white and green points given their 2D position on the photograph.
Next, instead of a piece of paper, think about a realistic three dimensional clay model of the mountain. Now all the white points occupy a cone at the peak of the mountain and all of the green points lie at the base of the mountain. Imagine inserting a sheet of cardboard through the clay model in a way that divides the snow capped peak from the green-tree-covered base. It is much easier to do now because the white points are sticking up into the high altitude and the green points are all on the base of the mountain.
The position of that piece of cardboard is the planar separation function that divides white points from green points. A support vector machine analysis of this scenario would take the original two dimensional point data and search for a projection into three dimensions that would maximize the spacing between green points and white points. The result of the analysis would be a mathematical description of the position and orientation of the cardboard plane. Given inputs describing a novel data point, the SVM could then map the data into the higher dimensional space and then report whether the point was above the cardboard (a white point) or below the cardboard (a green point). The so called support vectors contain the coefficients that map the input data for each case into the high dimensional space.
You have attempted of activities on this page.