.. Copyright (C) Google, Runestone Interactive LLC
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/4.0/.
.. _measures_of_center:
Measures of Center
==================
Some of the most widely used statistics are “measures of center,” which
describe where the data is centered. This type of statistic is extremely useful,
as it allows you to summarize all the data with one number/category. Statistics
that summarize key aspects of the data are called **summary statistics**.
You can think of a measure of center as a “best guess”. If you had one guess at
the value of a new observation (e.g. the height of a new student who joins the
class), what would you guess?
.. admonition:: The Three Measures of Center
The **mean** is the standard form of averaging.
- The mean is calculated by adding up all the values in the data, and
dividing by the number of data points.
- The mean is only defined for quantitative variables.
- Sometimes the mean is referred to simply as the “average” (for example in
Sheets), but it is better to remember it as the mean.
The **median** is the middle value.
- If you order all the data points from lowest to highest, the median is
the value that sits directly in the middle.
- If there are two “middle values” (which occurs when there is an even
number of data points), the median is halfway between the two values.
- The median is only defined for quantitative variables.
- The median is the value such that half the values are below it, and half
the values are above it.
The **mode** is the most commonly occurring value in the data.
- If you count the number of times each category occurs, the mode is the
category with the highest count.
- Sometimes the mode is referred to simply as the most common value.
- A dataset can have multiple modes - if multiple categories have the same
count, and this count is higher than those of the other categories.
- The mode is only defined for categorical variables.
When handling different types of data, some measures of center can tell you more
useful information than others can. Whenever you have a categorical
variable, the mode is your only choice for a measure of center. However, with
quantitative variables, either the mean or median can be used to describe the
“best guess”. You can begin building intuition about which of the three measures
of center is most appropriate through some examples.
.. mchoice:: new_student_birth_country
You have a dataset on the birth country of all of your students. A new
student joins your class, and you want to take a “best guess” at where she
comes from. What measure of center would you use?
- Mean
- Incorrect: What kind of variable is birth country?
- Median
- Incorrect: What kind of variable is birth country?
- Mode
+ Correct
.. image:: figures/seattle.png
:align: center
.. shortanswer:: cost_of_living_seattle_new_york
You want to do a study on whether it is more expensive to rent an apartment
in Seattle or in New York City. What data would you collect to answer this
question, and what summary statistics could be useful?
The mean, median, and mode functions in Sheets all have the exact same syntax as
the ``MIN`` and ``MAX`` functions, defined earlier
:ref:`here`. You can either input all relevant
values into the function separated by commas, or you can define a cell range.
The latter is far more convenient in most cases.
.. admonition:: Measures of Center in Sheets
**The AVERAGE function returns the mean of a set of values.** You can either
input several values separated by a comma (e.g.
``=AVERAGE(value1, value2, value3)``), or you can input a range of cells of
which you want to know the mean (e.g. ``=AVERAGE(A1:A10)``).
Note that mean is called AVERAGE in Sheets. It is nevertheless recommended to
use the term “mean” to describe this measure of center wherever possible
(e.g. in reports and articles), to disambiguate different measures of center.
`See here for a longer discussion.`_
**The MEDIAN function returns the median of a set of values.** You can either
input several values separated by a comma (e.g.
``=MEDIAN(value1, value2, value3)``), or you can input a range of cells of
which you want to know the median (e.g. ``=MEDIAN(A1:A10)``).
**The MODE function returns the mode of a set of values.** You can either
input several values separated by a comma (e.g.
``=MODE(value1, value2, value3)``), or you can input a range of cells of
which you want to know the mode (e.g. ``=MODE(A1:A10)``).
Example: Test Scores
--------------------
Say you are helping grade for a class and your professor has given you a list of
student scores for the last exam. How would you calculate the median and mode
in sheets?
.. image:: figures/test_scores.png
:align: center
.. TODO(raskutti): Embed
https://docs.google.com/spreadsheets/d/1WrXhnF-KJ3ixtPtBSKoPiQ24e9qwc4tRLNdC865W8Ck/edit?usp=sharing&resourcekey=0-Ou9WqUHlrmr3LokGeo7WuQ
.. fillintheblank:: mean_test_scores
Given the sheet above, write a formula for the mean of the test scores.
|blank|
- :=MEAN\(A1\:A6\): Correct
:MEAN\(A1\:A6\): Incorrect: Remember formulas must start with ``=``.
:x: Incorrect
.. fillintheblank:: median_test_scores
Given the sheet above, write a formula for the median of the test scores.
|blank|
- :=MEDIAN\(A1\:A6\): Correct
:MEDIAN\(A1\:A6\): Incorrect: Remember formulas must start with ``=``.
:x: Incorrect
Now that you have some practice with creating formulas to calculate median and
mean, you can start to build some intuition as to what the differences between
these measures of center may be. Say someone asked you for your advice about
where they wanted to move after graduation, and that the weather was a major
concern for them. You want to give them a summary statistic to accurately
summarize what the weather might be like at those respective locations. Would
the mean or median make more sense? The next example can help you understand
when you would want to use the mean versus the median.
.. _measures_of_center_weather:
Example: Weather
----------------
First, calculate and compare the mean maximum daily temperature in Seattle
and New York City (NYC). The data for the two cities’ temperatures are in two
different sheets.
.. _Temperature Spreadsheet.: https://docs.google.com/spreadsheets/d/1WrXhnF-KJ3ixtPtBSKoPiQ24e9qwc4tRLNdC865W8Ck/edit?usp=sharing&resourcekey=0-Ou9WqUHlrmr3LokGeo7WuQ
.. TODO(raskutti):
https://docs.google.com/spreadsheets/d/1WrXhnF-KJ3ixtPtBSKoPiQ24e9qwc4tRLNdC865W8Ck/edit?usp=sharing&resourcekey=0-Ou9WqUHlrmr3LokGeo7WuQ
The “actual_max_temp” is in column D, and tells you the maximum daily
temperature. Calculating the mean of that is as simple as using the ``AVERAGE``
function on that cell range as shown in the image below. From this, you can see
that the mean maximum temperature in Seattle is 64.2 degrees.
.. image:: figures/sea_max_average.png
:align: center
You can now switch to the NYC sheet and use the exact same formula.
.. fillintheblank:: nyc_mean_max_temp
What is the mean maximum temperature in NYC? (Use 1 decimal point.) |blank|
- :61.7: Correct
:x: Incorrect
This example indicates that on average, over the course of twelve months,
Seattle and NYC have fairly similar temperatures. Does this seem right to you?
In reality, for a given time of year, the temperatures of Seattle and NYC
usually differ significantly. NYC winters are considerably colder than Seattle
winters, and NYC summers tend to be warmer than Seattle summers. When averaged
over twelve months, however, these effects “cancelled out”, and, when looking
just at the mean, it may look as if Seattle and NYC have similar temperatures
all year round.
Sometimes summary statistics can over-summarize the data. You
will learn more about how to take this over-summarization into account in the
:ref:`section below on measures of spread`. In the
meantime, you can look closer into investigating the median of this data.
.. shortanswer:: nyc_and_seattle_median_temperatures
Calculate the median maximum temperatures for Seattle and NYC.
Do these statistics tell a different story? Why?
Right now, the mean and median may not seem all that different. However, there
are cases where the median is more useful than the mean. The
:ref:`next section on outliers ` will explain this difference
through an example on family income.
.. _See here for a longer discussion.: https://www.quora.com/What-is-difference-between-the-mean-and-the-average