Skip to main content

Section 12.2 Preparing the Data

To get started with support vector machines, we can load one of the R packages that supports this technique. We will use the "kernlab" package. Use the commands below:
I found that it was important to use the double quotes in the first command, but not in the second command. The data set that we want to use is built into this package. The data comes from a study of spam emails received by employees at the Hewlett-Packard company. Load the data with the following command:
This command does not produce any output. We can now inspect the "spam" dataset with the str() command:
Some of the lines of output have been elided from the material above. You can also use the dim() function to get a quick overview of the data structure:
The dim() function shows the "dimensions" of the data structure. The output of this dim() function shows that the spam data structure has 4601 rows and 58 columns. If you inspect a few of the column names that emerged from the str() command, you may see that each email is coded with respect to its contents. There is lots of information available about the data set here:
For example, just before "type" at the end of the output of the str() command on the previous page, we see a variable called "capitalTotal." This is the total number of capital letters in the whole email. Right after that is the criterion variable, "type," that indicates whether an email was classified as spam by human experts. Let’s explore this variable a bit more:
We use the table function because type is a factor rather than a numeric variable. The output shows us that there are 2788 messages that were classified by human experts as not spam, and 1813 messages that were classified as spam. What a great dataset!
To make the analysis work we need to divide the dataset into a training set and a test set. There is no universal way to do this, but as a rule of thumb, you can use two thirds of the data set to train and the remainder to test. Let’s first generate a randomized index that will let us choose cases for our training and test sets. In the following command, we create a new list/vector variable that samples at random from a list of numbers ranging from 1 to the final element index of the spam data (4601).
The output of the summary() and length() commands above show that we have successfully created a list of indices ranging from 1 to 4601 and that the total length of our index list is the same as the number of rows in the spam dataset: 4601. We can confirm that the indices are randomized by looking at the first few cases:
It is important to randomize your selection of cases for the training and test sets in order to ensure that there is no systematic bias in the selection of cases. We have no way of knowing how the original dataset was sorted (if at all) in case it was sorted on some variable of interest we do not just want to take the first 2/3rds of the cases as the training set.
Next, let’s calculate the "cut point" that would divide the spam dataset into a two thirds training set and a one third test set:
The first command in this group calculates the two-thirds cut point based on the number of rows in spam (the expression dim(spam)[1] gives the number of rows in the spam dataset). The second command reveals that that cut point is 3067 rows into the data set, which seems very sensible given that there are 4601 rows in total. Note that the floor() function chops off any decimal part of the calculation. We want to get rid of any decimal because an index variable needs to be an integer.
You have attempted of activities on this page.