.. Copyright (C) Google, Runestone Interactive LLC This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. .. _measures_of_spread: Measures of Spread ================== Measures of center are very useful for giving you a “best guess” at a variable. But how useful are those guesses? In this section, you will learn about **standard deviation** and **variance**. These are the most common "measures of spread" statistics, since they indicate how spread out a dataset is. These statistics are also used to inform how useful other statistics (such as the mean) are for making predictions. Trying to guess the value of a variable that doesn’t change much is a lot easier than trying to guess the value of a variable that can change drastically. To take an extreme example, suppose there are two very different cities next to each other. In “Consistentville”, everyone has the same yearly salary of $50,000. In “Wonkytown” exactly half the people have a yearly salary of $100,000, and the other half are unemployed. .. https://docs.google.com/presentation/d/17OT9PXnVrXp1YhQGSm2FKNDdQbcn3QSxrMiwqU6E0Go/edit?usp=sharing .. image:: figures/consistentville_and_wonkytown.png :align: center :alt: A visual of Wonkytown and Consistentville names. .. image:: figures/city_income.png :align: center :alt: A table for the salaries of people in Wonkytown and Consistentville. Above is an example dataset for 6 people in each of the towns and their salaries. .. fillintheblank:: mean_salary_in_consistentville What is the mean salary in Consistentville? (Give your answer in dollars.) |blank| - :(\$|)50(,|)000: Correct :x: Incorrect .. fillintheblank:: mean_salary_in_wonkytown What is the mean salary in Wonkytown? (Give your answer in dollars.) |blank| - :(\$|)50(,|)000: Correct :x: Incorrect Since all residents of Consistentville make the same salary of $50,000, the mean salary is simply $50,000. Now since exactly half the residents of Wonkytown make $100,000 and the other half make $0, it should make some intuitive sense that the mean salary in Wonkytown is also $50,000. So “on average,” residents of Consistentville and Wonkytown make the same salary. .. shortanswer:: random_resident_salary If you take a random resident of Consistentville, what would you guess their salary to be? Do you expect to be correct? What about in Wonkytown? If you take a random resident of Consistentville and guess their salary to be the mean of $50,000, you would be right every single time. However, if you did the same in Wonkytown, you would be wrong every single time! Not only would your guess be wrong, it would be either $50,000 below or $50,000 above their true salary, both of which are way off! So in this case, while the mean was an extremely effective “best guess” in Consistentville, it was not so useful in Wonkytown. In a city like Wonkytown, it is pretty hard to form a “best guess”. That’s where measures of spread come in. A measure of spread statistic doesn’t refine a measure of center, but it can tell you how useful that measure of center is. The most common measure of spread is called the standard deviation. .. admonition:: Standard Deviation Definition **The standard deviation is a measure of how spread out data points in a dataset are.** A low standard deviation means the data points in the dataset are generally close together. A high standard deviation means the data points in the dataset are generally far apart. By itself, the standard deviation can help you estimate how good your “best guess” is. It is even more useful when comparing two different datasets, as it can tell you which dataset is more spread out. In the case of comparing Consistentville and Wonkytown, knowing the standard deviation alone would tell you that in Consistentville everyone makes the same salary, while in Wonkytown the salaries differ from the mean on average by $50,000. If you were guessing salaries, knowing the standard deviation would help you make a much more informed guess! Standard Deviation in a Histogram --------------------------------- Look at the side-by-side histogram below. It contains two variables (one in red and one in blue) with the same mean, but one with a much higher standard deviation than the other. .. https://docs.google.com/spreadsheets/d/1WrXhnF-KJ3ixtPtBSKoPiQ24e9qwc4tRLNdC865W8Ck/edit?usp=sharing&resourcekey=0-Ou9WqUHlrmr3LokGeo7WuQ .. image:: figures/standard_deviation_in_histograms.png :align: center :alt: A histogram of two variables. Variable two is concentrated in a smaller range across the horizontal axis with high values, while variable one is spread out across the horizontal axis with lower vertical axis values. .. mchoice:: standard_deviation_in_histograms Which variable do you think has a higher standard deviation? - Red - Incorrect: The red values are all clustered close together around 0. - Blue + Correct. The blue values are spread out over a wide range. .. _measures_of_spread_weather: Example: Weather ---------------- Returning to the comparison of weather in Seattle and NYC, this example shows you how to calculate the standard deviation of a dataset in Sheets. Again, you will use the daily maximum temperature column. .. admonition:: Standard Deviation in Sheets **The STDEVP function calculates the standard deviation of a dataset.** As with previous summary statistic functions, you can either input several values separated by a comma (e.g. ``=STDEVP(value1, value2, value3)``), or you can input a range of cells of which you want to know the mean (e.g. ``=STDEVP(A1:A10)``). Note that there are several variants of the ``STDEVP`` function in Sheets. In this section, you can always use the ``STDEVP`` function. If you are interested in the difference between the different variants, `this thread goes into some detail on the practical differences`_, and `this thread goes into the mathematical theory behind the difference`_. In practice, there is not much numeric difference between the different functions. Finding the standard deviation of the maximum daily weather for Seattle is almost the same as finding the average, except you use the ``STDEVP`` function. This tells you that the standard deviation of the maximum daily temperature in Seattle is 12.9 degrees. .. fillintheblank:: standard_deviation_seattle_max_temp What is the standard deviation of the maximum daily temperature of NYC? (Use 1 decimal point in your answer.) |blank| - :19.4: Correct :x: Incorrect You have already seen :ref:`earlier` the mean temperatures for Seattle and NYC differ only by 3 degrees. The standard deviation shows you that the variability of the maximum daily temperature is almost 7 degrees (more than 50%) higher in NYC compared to Seattle. This example should illustrate that knowing the mean sometimes isn’t enough. Just using the mean, you may have believed that Seattle and NYC have very similar temperature all year round. Knowing the standard deviation alongside the average, however, tells you that while Seattle and NYC have similar mean temperatures, there is much higher year-round variability in NYC. If you then add in the knowledge of the maximum and minimum temperatures of both cities, you would have a pretty good idea of the year-round temperature seasonality of both cities. Extension: Variance ------------------- This material is intended as a reference for those who are curious. It describes, with more theory and mathematics, why variance is a crucial concept for mathematicians and statistics. While standard deviation is more widely used, it is actually derived from another measure of spread, called the variance. More precisely, **the standard deviation is the square root of the variance**. Many `probability distributions`_ are defined in terms of mean and variance (not standard deviation). `You can find another detailed explanation in this article.`_ .. admonition:: Variance Definition The variance is the mean of the squared deviation (or squared difference) from the variable to its mean. That is a lot of words. A better way to understand it is to outline the procedure for calculating the variance of a dataset, call it dataset A. 1. Calculate the mean of dataset A. 2. Find the difference between the mean of dataset A and each value in dataset A. These values form a new dataset, dataset B. 3. Square all the values in dataset B. These values form a new dataset, dataset C. 4. The mean of dataset C is the variance of dataset A. More intuitively, dataset B shows you how far points in dataset A are from the center of dataset A. Squaring the values in dataset B is a way to make the differences all positive (to make sure values above and below the mean are equally “far”). Then the mean of the squared differences in dataset C tells you “on average” how far the points in A are from the mean. .. admonition:: Variance in Sheets **The VARP function calculates the variance of a dataset.** As with previous summary statistic functions, you can either input several values separated by a comma (e.g. ``=VARP(value1, value2, value3)``), or you can input a range of cells of which you want to know the mean (e.g. ``=VARP(A1:A10)``). (Note: :ref:`the same caveat` as with ``STDEVP`` applies to ``VARP``.) In Consistentville, every salary is $50,000 and the mean is $50,000. Therefore, all values in dataset B are zero, so all values in dataset C are zero. The mean of this all-zero dataset is zero, so the variance of salary in Consistentville is zero. (This happens if and only if all values in the dataset are the same.) In Wonkytown, every salary is $50,000 away from the mean (either above or below). Therefore, all values in dataset B are $50,000 so all values in dataset C are the square of that, 2,500,000,000, in units of dollars squared. The mean of dataset C, and therefore the variance of salary in Wonkytown, is this same value. One downside of the variance is its unit of measure. Since it involves squaring the values of dataset B, the unit of measure of the variance is always the squared unit of measure of the initial dataset (dataset A). For example, if considering the salaries of Consistentville or Wonkytown, the variance would be in squared dollars. This might not be very useful, and this is how the standard deviation (square root of the variance) came to be widely used. The purpose of the standard deviation is to express the variance but in the same unit as the data. The standard deviations of the salaries in Consistentville and Wonkytown are measured in dollars. In Consistentville, the variance is zero, so the standard deviation is the square root of zero, which is also zero. In Wonkytown, the variance is 2,500,000,000 dollars squared, so the standard deviation is the square root of this, which is (you may have guessed it) $50,000. Both of these findings tell you just how far you can expect your guess to be from the mean: $0 in Consistentville, $50,000 in Wonkytown. Note that it is not always possible to calculate the variance and standard deviation manually. Usually, you will have to use a tool such as Sheets. Example: Student Heights ------------------------ Suppose you have this `dataset containing the heights of students in a class`_. .. image:: figures/student_heights_outlier.png :align: center :alt: A Sheets screenshot of a dataset of student heights. First, use the method of calculating variance (above) to calculate the variance and standard deviation of this dataset. Then, you can confirm your answers using ``VARP`` and ``STDEVP``. .. fillintheblank:: variance_of_students_heights What is the variance of the heights among these students? (Use 1 decimal point in your answer). |blank| - :10.5: Correct :x: Incorrect .. fillintheblank:: standard_deviation_of_students_heights What is the standard deviation of the heights among these students? (Use 1 decimal point in your answer). |blank| - :3.2: Correct :x: Incorrect Further Application -------------------- There are real-world applications that these measures of center can be used for. `This exercise explores the salaries of professional athletes`_ with measures of spread as well as other statistics. Try this on your own in Sheets if you are interested in getting more experience with any of the previously learned statistics. .. _dataset containing the heights of students in a class: https://docs.google.com/spreadsheets/d/1c6q8T-4U3EUHnBtXWGvspgUNRFjradkwlgchJAO_W_4/edit?usp=sharing .. _this thread goes into some detail on the practical differences: https://www.quora.com/What-is-the-difference-between-sample-standard-deviation-and-population-standard-deviation .. _this thread goes into the mathematical theory behind the difference: https://math.stackexchange.com/questions/15098/sample-standard-deviation-vs-population-standard-deviation .. _probability distributions: https://en.wikipedia.org/wiki/Probability_distribution .. _You can find another detailed explanation in this article.: https://www.mathsisfun.com/data/standard-deviation.html .. _This exercise explores the salaries of professional athletes: https://www.ck12.org/statistics/Applications-of-Variance-and-Standard-Deviation/rwa/Variance-of-a-Data-Set/