Section 8.1 Big Data? Big Deal!
The technology press contains many headlines about big data. What makes data big, and why is this bigness important? In this chapter, we discuss some of the real issues behind these questions. Armed with information from the previous chapter concerning sampling, we can give more thought to how the size of a data set affects what we do with the data.
MarketWatch (a Wall Street Journal Service) published an article with the title, "Big Data Equals Big Business Opportunity Say Global IT and Business Professionals," and the subtitle, "70 Percent of Organizations Now Considering, Planning or Running Big Data Projects According to New Global Survey." The technology news has been full of similar articles for several years. Given the number of such articles it is hard to resist the idea that "big data" represents some kind of revolution that has turned the whole world of information and technology topsy-turvy. But is this really true? Does "big data" change everything?
Business analyst Doug Laney suggested that three characteristics make "big data" different from what came before: volume, velocity, and variety. Volume refers to the sheer amount of data. Velocity focuses on how quickly data arrives as well as how quickly those data become "stale." Finally, Variety reflects the fact that there may be many different kinds of data. Together, these three characteristics are often referred to as the "three Vs" model of big data. Note, however, that even before the dawn of the computer age weβve had a variety of data, some of which arrives quite quickly, and that can add up to quite a lot of total storage over time (think, for example, of the large variety and volume of data that has arrived annually at Library of Congress since the 1800s!). So it is difficult to tell, just based on someone saying that they have a high volume, high velocity, and high variety data problem, that big data is fundamentally a brand new thing.
With that said, there are certainly many changes afoot that make data problems qualitatively different today as compared with a few years ago. Letβs list a few things which are pretty accurate:
-
The decline in the price of sensors (like barcode readers) and other technology over recent decades has made it cheaper and easier to collect a lot more data.
-
Similarly, the declining cost of storage has made it practical to keep lots of data hanging around, regardless of its quality or usefulness.
-
Many peopleβs attitudes about privacy seem to have accommodated the use of Facebook and other platforms where we reveal lots of information about ourselves.
-
Researchers have made significant advances in the "machine learning" algorithms that form the basis of many data mining techniques.
-
When a data set gets to a certain size (into the range of thousands of rows), conventional tests of statistical significance are meaningless, because even the most tiny and trivial results (or effect sizes, as statisticians call them) are statistically significant.
Keeping these points in mind, there are also a number of things that have not changed throughout the years:
-
Garbage in, garbage out: The usefulness of data depends heavily upon how carefully and well it was collected. After data were collected, the quality depends upon how much attention was paid to suitable pre-processing: data cleaning and data screening.
-
Bigger equals weirder: If you are looking for anomalies, rare events that break the rules, then larger is better. Low frequency events often do not appear until a data collection goes on for a long time and/or encompasses a large enough group of instances to contain one of the bizarre cases.
-
Linking adds potential: Standalone datasets are inherently limited by whatever variables are available. But if those data can be linked to some other data, all of a sudden new vistas may open up. No guarantees, but the more you can connect records here to other records over there, the more potential findings you have.
Items on both of the lists above are considered pretty commonplace and uncontroversial. Taken together, however, they do shed some light on the question of how important "big data" might be. We have had lots of historical success using conventional statistics to examine modestly sized (i.e., 1000 rows or less) datasets for statistical regularities. Everyoneβs favorite basic statistic, the Studentβs t-test, is essential a test for differences in the central tendency of two groups. If the data contain regularities such that one group is notably different from another group, a t-test shows it to be so.
Big data does not help us with these kinds of tests. We donβt even need a thousand records for many conventional statistical comparisons, and having a million or a hundred million records wonβt make our job any easier (it will just take more computer memory and storage). Think about what you read in the previous chapter: We were able to start using a basic form of statistical inference with a data set that contained a population with only 51 elements. In fact, many of the most commonly used statistical techniques, like the Studentβs t-test, were designed specifically to work with very small samples.
On the other hand, if we are looking for needles in haystacks, it makes sense to look (as efficiently as possible) through the biggest possible haystack we can find, because it is much more likely that a big haystack will contain at least one needle and maybe more. Keeping in mind the advances in machine learning that have occurred over recent years, we begin to have an idea that good tools together with big data and interesting questions about unusual patterns could indeed provide some powerful new insights.
Letβs couple this optimism with three very important cautions. The first caution is that the more complex our data are, the more difficult it will be to ensure that the data are "clean" and suitable for the purpose we plan for them. A dirty data set is worse in some ways than no data at all because we may put a lot of time and effort into finding an insight and find nothing. Even more problematic, we may put a lot of time and effort and find a result that is simply wrong! Many analysts believe that cleaning data, getting it ready for analysis, weeding out the anomalies, organizing the data into a suitable configuration actually takes up most of the time and effort of the analysis process.
The second caution is that rare and unusual events or patterns are almost always by their nature highly unpredictable. Even with the best data we can imagine and plenty of variables, we will almost always have a lot of trouble accurately enumerating all of the causes of an event. The data mining tools may show us a pattern, and we may even be able to replicate the pattern in some new data, but we may never be confident that we have understood the pattern to the point where we believe we can isolate, control, or understand the causes. Predicting the path of hurricanes provides a great example here: despite decades of advances in weather instrumentation, forecasting, and number crunching, meteorologists still have great difficulty predicting where a hurricane will make landfall or how hard the winds will blow when it gets there. The complexity and unpredictability of the forces at work make the task exceedingly difficult.
The third caution is about linking data sets. Item C above suggests that linkages may provide additional value. With every linkage to a new data set, however, we also increase the complexity of the data and the likelihood of dirty data and resulting spurious patterns. In addition, although many companies seem less and less concerned about the idea, the more we link data about living people (e.g., consumers, patients, voters, etc.) the more likely we are to cause a catastrophic loss of privacy. Even if you are not a big fan of the importance of privacy on principle, it is clear that security and privacy failures have cost companies dearly both in money and reputation. Todayβs data innovations for valuable and acceptable purposes maybe tomorrowβs crimes and scams. The greater the amount of linkage between data sets, the easier it is for those people with malevolent intentions to exploit it.
Putting this all together, we can take a sensible position that high quality data, in abundance, together with tools used by intelligent analysts in a secure environment, may provide worthwhile benefits in the commercial sector, in education, in government, and in other areas. The focus of our efforts as data scientists, however, should not be on achieving the largest possible data sets, but rather on getting the right data and the right amount of data for the purpose we intend. There is no special virtue in having a lot of data if those data are unsuitable to the conclusions that we want to draw. Likewise, simply getting data more quickly does not guarantee that what we get will be highly relevant to our problems. Finally, although it is said that variety is the spice of life, complexity is often a danger to reliability and trustworthiness: the more complex the linkages among our data the more likely it is that problems may crop up in making use of those data or keeping them safe.
You have attempted of activities on this page.