Section 12.4 Testing the Model
The real acid test for our support vector model, however, will be to use the support vectors we generated through this training process to predict the outcomes in a novel data set. Fortunately, because we prepared carefully, we have the testData training set ready to go. The following commands with give us some output known as a "confusion matrix:"
The first command in the block above uses our model output from before, namely svmOutput, as the parameters for prediction. It uses the "testData," which the support vectors have never seen before, to generate predictions, and it requests "votes" from the prediction process. We could also look at probabilities and other types of model output, but for a simple analysis of whether the svm is generating good predictions, votes will make our lives easier.
The output from the predict() command is a two dimensional list. You should use the str() command to examine its structure. basically there are two lists of "vote" values side by side. Both lists are 1534 elements long, corresponding to the 1534 cases in our testData object. The lefthand list has one for a non-spam vote and zero for a spam vote. Because this is a two-class problem, the other list has just the opposite. We can use either one because they are mirror images of each other.
In the second command above, we make a little dataframe, called compTable, with two variables in it: The first variable is the 58th column in the test data, which is the last column containing the "type" variable (a factor indicating spam or non-spam). Remember that this type variable is the human judgments from the original dataset , so it is our ground truth. The second variable is the first column in our votes data structure (svmPred), so it contains ones for non-spam predictions and zeros for spam predictions.
Finally, applying the table() command to our new dataframe (compTable) gives us the confusion matrix as output. Along the main diagonal we see the erroneous classifications 38 cases that were not spam, but were classified as spam by the support vector matrix and 68 cases that were spam, but were classified as non spam by the support vector matrix. On the counter-diagonal, we see 854 cases that were correctly classified as non-spam and 574 cases that were correctly classified as spam.
Overall, it looks like we did a pretty good job. There are a bunch of different ways of calculating the accuracy of the prediction, depending upon what you are most interested in. The simplest way is to sum the 68 + 38 = 106 error cases and divided by the 1534 total cases for an total error rate of about 6.9%. Interestingly, that is a tad better than the 8.5% error rate we got from the k-fold crossvalidation in the run of svm() that created the model we are testing. Keep in mind, though, that we may be more interested in certain kinds of error than other kinds. For example, consider which is worse, an email that gets mistakenly quarantined because it is not really spam, or a spam email that gets through to someoneβs inbox? It really depends on the situation, but you can see that you might want to give more consideration to either the 68 misclassification errors or the other set of 38 misclassification errors.
You have attempted of activities on this page.