.. Copyright (C) Google, Runestone Interactive LLC
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/4.0/.
.. _variables:
Variables
=========
.. admonition:: Variable Definition
**A variable is something that changes from one observation to another.** A
dataset typically has one or more variables.
In the example above about heights of students, there is one variable: height.
Each observation is related to a different student in the class. In general,
datasets can have more than one variable. For example, the classroom dataset
could also contain the variables age, hair color, and shoe size.
As we begin looking at specific statistics, it’s important to note what
statistics mean for different types of variables. The type of variable
determines what statistics you can calculate and what questions you can answer.
There are two types of variables.
- A **quantitative variable** or “numeric” variable can only take on number
values. Examples are:
- height
- age
- years of education
- A **categorical variable** can only take on certain non-numeric values.
Examples are:
- hair color: some categories are “black”, “brown”, “blond”, etc.
- handedness: some categories are “left-handed”, “right-handed”,
“ambidextrous”
- preferred mode of transportation: some categories are “train”, “car”,
“bicycle”, etc.
All variables can be described as quantitative or categorical. Because some
concepts in statistics apply to only one type of variable, however, it is always
important to classify the variables in your dataset as quantitative or
categorical. For example, it doesn’t make much sense to calculate the average
city in a set of cities. In contrast, calculating the average temperature makes
complete sense. Also, when plotting counts of a variable, histograms are used
for quantitative variables and bar charts are used for categorical variables.
(More on this distinction can be found in
:ref:`the section on histograms and bar charts`.)
Knowing a variable’s type can also help with identifying missing or incorrect
data. For example, suppose you have a dataset of students’ heights, and one of
the values is “California”. Height is a quantitative variable but “California”
is a categorical response, so it is likely this was entered incorrectly.
.. mchoice:: quantitative_vs_categorical_variables
Which of the following are categorical variables?
- US state
+ Correct
- Shoe size
- Incorrect
- Undergraduate college
+ Correct
- Annual income
- Incorrect
- Population of Europe
- Incorrect
.. _variables_weather:
Example: Student Data
---------------------
To illustrate the difference between quantitative and categorical variable,
consider the following example concerning data of `students in a class`_.
.. image:: figures/sheet_example.png
The dataset contains, for a 20 student class, the name, height, hair color, and
birthday of students in the class.
Column B has a numeric value, so it is a quantitative variable. Columns A, C,
and D are categorical.
The “Name” and “Hair Color” variables can only be a fixed set of non-numeric
values.
Example: Weather
----------------
In this and all following examples using this dataset, the temperature is
reported in degrees Fahrenheit.
The dataset contains, for several US cities, the average (mean), minimum, and
maximum temperatures for each day from July 1, 2014 to June 30, 2015. It should
be relatively clear, just by looking at the values of the variables, which
variables are quantitative and which are categorical.
Columns D, E, and F have numeric values, and are quantitative variables. Columns
B and C are categorical.
The “month_text” can only be one of 12 words (January through December), and the
“city” must be a US city. (Since there is a natural ordering to months, the
“month_text” variable is an example of an
:ref:`ordered categorical variable`. The “city”
variable is a standard unordered categorical variable.)
The “date” variable in Column A is a little trickier, and could be considered
either quantitative or categorical. You could encode each new day as a whole
number (for example, 2014-7-1 maps to 1, 2014-7-2 maps to 2, 2015-6-30 maps to
365), in which case “date” would be quantitative. (It would be a
:ref:`discrete quantitative variable`.)
However, you could also argue that, given that the timeframe of this dataset
is July 2014 to June 2015, each day is a new category of the possible 365
categories. (This would then be an ordered categorical variable.)
How you choose to consider this variable depends on how you want to use this
data. For example, if you want to graph the daily temperature over time, you
would need to have date as a quantitative variable (so it can be used as the
x-axis). In contrast, if you want to subset the data and look at temperatures
only for all of the Mondays, it makes sense to think of date as a categorical
variable (so “Monday” is its own overarching category in which each day either
falls or does not fall).
.. _discrete_and_continuous_variables:
Extension: Discrete and Continuous Variables
--------------------------------------------
Under the umbrella of quantitative variables, there are two important distinct
types.
- A **discrete variable** is a quantitative variable that can only take
certain values. The most common examples are variables that can only be a
whole number (e.g. number of stairs in a building, number of children).
Another example would be shoe size, which can be whole numbers or half
numbers.
- A **continuous variable** is a quantitative variable that can take any value
within a range. Examples of this are numeric variables that can be expressed
to as many decimal places as necessary.
In general, it is always a good idea to know what the possible values that a
variable can take. This includes whether the variable is discrete or continuous,
as well as what the range of possible values is. (This `range of values`_ is
called the **support**.) This can help with finding missing or wrong data. For
example, if you have a dataset on height and one of the values is zero, you
might assume that datapoint is missing, since you know height must be positive.
(Moreover if one of the values is negative, you can assume that datapoint was
incorrectly recorded.)
To illustrate the difference between a discrete and continuous variable,
consider the example of height. In general, a person’s height can be expressed
to as many decimal places as necessary, for example 172.9532145 centimeters. So
it is a continuous variable. However, height is *usually* rounded to the nearest
feet and inches (5ft 8in) or to the nearest centimeter (173cm). In these cases,
it is a discrete variable, as it can only take certain values. In contrast, shoe
size is always a discrete variable. (A shoe size of 7.234 does not exist.)
.. _ordered_categorical_variables:
Extension: Ordered Categorical Variables
----------------------------------------
Categorical variables are usually unordered. This means that there is no typical
ranking to the categories. For the variable “eye color”, there is no obvious
ordering to the values. You couldn’t say that in general, brown eyes is more or
less than blue eyes.
However, some categorical variables have a natural ordering to them. For
example, consider the variable “highest level of education” where the values
are:
1. No high school diploma
2. High school diploma
3. Undergraduate degree
4. Masters degree
5. Doctoral or equivalent professional degree
While this is clearly a categorical variable since the values are non-numeric,
there is a typical ordering of the values (e.g. getting a Masters degree
requires more schooling beyond an undergraduate degree). This type of variable
is called an **ordinal variable** (or **ordered categorical variable**).
In more advanced statistics, there are models that work with `ordinal
variables`_. However, this is well beyond the scope of this course.
.. shortanswer:: categorical_variable_example
Think of a different example of an ordered categorical variable.
.. _students in a class: https://docs.google.com/spreadsheets/d/1c6q8T-4U3EUHnBtXWGvspgUNRFjradkwlgchJAO_W_4/edit?usp=sharing
.. _range of values: https://en.wikipedia.org/wiki/Support_(mathematics)
.. _define the probability distribution random variable: https://en.wikipedia.org/wiki/Random_variable#Examples
.. _ordinal variables: https://en.wikipedia.org/wiki/Ordinal_data