Skip to main content

Section 11.1 What is Data Mining?

Data Mining is an area of research and practice that is focused on discovering novel patterns in data. As usual, R has lots of possibilities for data mining. In this chapter we will begin experimentation with essential data mining techniques by trying out one of the easiest methods to understand: association rules mining. More beer and diapers please!
Data mining is a term that refers to the use of algorithms and computers to discover novel and interesting patterns within data. One famous example that gets mentioned quite frequently is the supermarket that analyzed patterns of purchasing behavior and found that diapers and beer were often purchased together. The supermarket manager decided to put a beer display close to the diaper aisle and supposedly sold more of both products as a result. Another familiar example comes from online merchant sites that say things like, "people who bought that book were also interested in this book." By using an algorithm to look at purchasing patterns, vendors are able to create automatic systems that make these kinds of recommendations.
Over recent decades, statisticians and computer scientists have developed many different algorithms that can search for patterns in different kinds of data. As computers get faster and the researchers do additional work on making these algorithms more efficient it becomes possible to look through larger and larger data sets looking for promising patterns. Today we have software that can search through massive data haystacks looking for lots of interesting and usable needles.
Some people refer to this area of research as machine learning. Machine learning focuses on creating computer algorithms that can use pre-existing inputs to refine and improve their own capabilities for dealing with future inputs. MAchine learning is very different from human learning. When we think of human learning, like learning the alphabet or learning a foreign language, humans can develop flexible and adaptable skills and knowledge that are applicable to a range of different contexts and problems. Machine learning is more about figuring out patterns of incoming information
that correspond to a specific result. For example, given lots of examples like this input: 3, 5, 10; output: 150 a machine learning algorithm could figure out on its own that multiplying the input values together produces the output value.
Machine learning is not exactly the same thing as data mining and vice versa. Not all data mining techniques rely on what researchers would consider machine learning. Likewise, machine learning is used in areas like robotics that we don’t commonly think of when we are thinking of data mining as such.
Data mining typically consists of four processes: 1) data preparation, 2) exploratory data analysis, 3) model development, and 4) interpretation of results. Although this sounds like a neat, linear set of steps, there is often a lot of back and forth through these processes, and especially among the first three. The other point that is interesting about these four steps is that Steps 3 and 4 seem like the most fun, but Step 1 usually takes the most amount of time. Step 1 involves making sure that the data are organized in the right way, that missing data fields are filled in, that inaccurate data are located and repaired or deleted, and that data are "recoded" as necessary to make them amenable to the kind of analysis we have in mind.
Step 2 is very similar to activities we have done in prior chapters of this book: getting to know the data using histograms and other visualization tools, and looking for preliminary hints that will guide our model choice. The exploration process also involves figuring out the right values for key parameters. We will see some of that activity in this chapter.
Step 3 choosing and developing a model is by far the most complex and most interesting of the activities of a data miner. It is here where you test out a selection of the most appropriate data mining techniques. Depending upon the structure of a dataset, there may be dozens of options, and choosing the most promising one has as much art in it as science.
For the current chapter we are going to focus on just one data mining technique, albeit one that is quite powerful and applicable to a range of very practical problems. So we will not really have to do Step 3, because we will not have two or more different mining techniques to compare. The technique we will use in this chapter is called "association rules mining" and it is the strategy that was used to find the diapers and beer association described earlier.
Step 4 the interpretation of results focuses on making sense out of what the data mining algorithm has produced. This is the most important step from the perspective of the data user, because this is where an actionable conclusion is formed. When we discussed the example of beer and diapers, the interpretation of the association rules that were derived from the grocery purchasing data is what led to the discover of the beer-diapers rule and the use of that rule in reconfiguring the displays in the store.
You have attempted of activities on this page.