Distributions and Variability

Section 1.1 Distributions and Variability

These first three investigations give you a very brief introduction to some big ideas for the course. Some of you will have seen some of these ideas before and can use the investigations to refresh your memory. Some of the ideas may be new and you will see them again in later chapters. For now, try to focus on the bigger picture of analyzing data, evaluating models, and drawing appropriate conclusions.

🔗

Exercises 1.1.1 Investigation A: Hurricanes and Climate Change

One of the concerns with climate change is an increased number of tropical storms (including hurricanes and major hurricanes). In particular, in the "Atlantic Basin," scientists have tracked the number of "named storms" since 1851. According to the National Hurricane Center:

🔗

Tropical Storm: A tropical cyclone with maximum sustained winds of 39 to 73 mph (34 to 63 knots).

🔗
Hurricane: A tropical cyclone with maximum sustained winds of 74 mph (64 knots) or higher. In the western North Pacific, hurricanes are called typhoons; similar storms in the Indian Ocean and South Pacific Ocean are called cyclones.

🔗
Major Hurricane: A tropical cyclone with max sustained winds of 111 mph (96 knots) or higher.

🔗

Snippet from NOAA data table showing storm classifications

In 2020, scientists were alarmed because there were 14 recorded hurricanes, compared to 6 in 2019.

🔗

Graph showing hurricane counts for 2019 and 2020

1. Calculate Percentage Change.

Calculate the percentage change in the number of tropical hurricanes between these two years.

🔗

Hint.

The formula for percentage change is: \(\frac{\text{new value} - \text{old value}}{\text{old value}} \times 100\%\)

🔗

Percentage change: %

🔗

Solution.

\(\frac{14-6}{6} \times 100\% = 133.3\%\) increase

🔗

2. Evaluate the Evidence.

Does this convince you that climate change is leading to an increase in the number of hurricanes (in the Atlantic)? If so, explain why. If not, explain what additional information you would want to know.

🔗

Hint.

Consider: Is comparing just two years enough evidence? What about natural year-to-year variation?

🔗

Solution.

Just because one year saw a large increase doesn’t necessarily reflect an increasing trend. It’s also hard to know whether this is a "large" increase when we don’t have information on how much this value tends to vary from year to year. We can’t draw any causal conclusions because other things could have changed in that time frame. We also need to keep in mind that "number of hurricanes" is just one possible reflection of "climate change."

🔗

Below is a dotplot of the annual number of hurricanes from 1851 to 2024 \((n = 174)\text{.}\)

🔗

Dotplot showing the distribution of annual number of hurricanes from 1851 to 2024

Source: https://www.stormfax.com/huryear.htm

3. Interpret the Dotplot.

What does one dot in the above graph represent?

🔗

Hint.

Each dot represents data from one observational unit. What is being measured here? How many observations do we have?

🔗

Solution.

One dot represents the number of hurricanes in one year.

🔗

A year with 14 hurricanes is certainly close to record setting, but we expect there to be some variation from year to year. Below is a time plot of the number of hurricanes each year.

🔗

Time plot showing the number of hurricanes each year from 1851 to 2024

4. Compare Graph Types.

What additional information is provided by this graph and why is that helpful?

🔗

Hint.

Think about what a time plot shows that a dotplot doesn’t - how does it arrange the data differently?

🔗

Solution.

Now we know which year each value corresponds to and we might see a gradual increasing trend overall since about 1970. We also see that the change from 6 to 14 is a rather large change between years.

🔗

5. Assess Reported Mean.

The stormfax website reports the mean number of hurricanes between 1991-2020 to be 7. Does that appear consistent with the graph? Why do you think they chose that subset of years?

🔗

Hint.

Consider what’s special about the 1991-2020 period. Is it the most recent data? Why might recent data be more relevant?

🔗

Solution.

This is consistent with the time plot. If we were to put a horizontal line at a height that goes through the middle of the values, 7 appears a reasonable value for that height. Perhaps they wanted to look at more recent data for more direct comparison with current data (opinions may vary).

🔗

Oftentimes a mean or average is reported, but with no measure of spread or variability. If all the years between 1991-2020 had between 6 and 8 hurricanes, we would react very differently to 14 hurricanes in one year than if all the years between 1991-2020 had between 2 and 15 hurricanes.

🔗

Terminology Detour: Standard Deviation.

The most common measure of the variability in a distribution of data is the standard deviation.

🔗

\begin{equation*} s = \sqrt{\frac{\sum_{i=1}^n (y_i - \bar{y})^2}{n-1}} \end{equation*}

🔗

We can roughly interpret the standard deviation as the average "deviation" of the data values in the distribution from the mean of the distribution. Another interpretation: If we were to predict 7 as the number of hurricanes in a year between 1991-2020, the standard deviation would approximate the average "prediction error" for those years.

🔗

Below is a dotplot of the data from 1991-2020 (\(n\) = 31).

🔗

Dotplot showing the distribution of annual number of hurricanes from 1991 to 2020

6. Compare Subset to Full Dataset.

Conjecture: The mean of these 31 observations is 7.2 hurricanes. Do you think 7.2 is larger or smaller or quite similar to the mean for the full dataset? Explain your reasoning.

🔗

Hint.

Look at the time plot - does the 1991-2020 period appear different from earlier decades?

🔗

Solution.

Answers will vary, but one might think the average is a bit higher in this more recent time frame than across the entire data set.

🔗

7. Estimate Standard Deviation.

Conjecture: Provide a guess of the value of the standard deviation of these 31 values.

🔗

Hint.

Look at the 1991-2020 dotplot - what’s a typical distance from the mean of 7.2? Most values fall within what range?

🔗

Solution.

About half or a little more than half of the values appear to fall between 4 and 10, so a reasonable estimate of the standard deviation would around 3 hurricanes.

🔗

8. Compare Standard Deviations.

Conjecture: How do you think the standard deviation from question 7 compares to the standard deviation of the full dataset? Explain your reasoning.

🔗

Hint.

Compare the spread in the 1991-2020 dotplot to the spread in the full 1851-2024 dotplot.

🔗

Solution.

The largest values (e.g., 14, 15) are in this subset and the spread in the values does appear larger in the more recent years than overall (fewer values in the 3 to 7 range compared to the full dataset).

🔗

We will often use the standard deviation as a "ruler" to help us measure distances of observations from the mean of the distribution.

🔗

9. Standardize the Value.

If we use a mean of 7.2 and a standard deviation of 3.3 hurricanes, how many standard deviations away from the mean is a value of 14 hurricanes? Above or below the mean?

🔗

Hint.

Calculate: \(\frac{14 - 7.2}{3.3}\)

🔗

standard deviations (indicate above or below):

🔗

Solution.

\(\frac{14 - 7.2}{3.3} = 2.06\) standard deviations above the mean

🔗

Terminology Detour: Standardizing.

The general formula for standardizing an observation’s position in the distribution is:

🔗

\begin{equation*} \frac{\text{observation value} - \text{mean of distribution}}{\text{standard deviation of distribution}} \end{equation*}

🔗

We will often consider a value far from the mean of a distribution if it is more than two standard deviations away.

🔗

In this investigation, you have just touched on one piece of information related to climate change. In fact, scientists are less concerned about the number of storms but in the intensity of the storms and how warming of the surface ocean may be leading to more destructive storms. Looking at a single year in isolation or even a pair of years creates a very incomplete picture of trends over time, and while we expect some natural variation from year-to-year, the question to scientists is whether the overall trend being observed is larger than what we can reasonably attribute to natural variation.

🔗

Insight 1.1.1. Points to keep in mind.

It’s important to determine which variables are most relevant to the research question and whether you can collect the data you need to answer the question.

🔗
Simple graphs can be very informative, but you should also take care in considering the most meaningful variable representation of what you are studying even before you begin graphing.

🔗
It is imperative to consider variability and to think about possible sources of variation. Sometimes you may be able to explain and "control for" a source of variation. Often you will have to dig deeper into reasons for unusual observations and whether it is appropriate to remove them from the analysis.

🔗
The quality of your inferences will depend A LOT on the quality of the data that are collected. Not much can be learned from poorly or improperly collected data or data from a completely different time period.

🔗

🔗

Discussion.

When exploring a research question, one of the first steps is to define the variable involved (e.g., the number of hurricanes). This is an example of a quantitative variable, as opposed to a categorical variable like whether the storm has winds over 74 mph. Dotplots are good choices for visualizing a quantitative variable for a small dataset. When looking at a distribution of a single quantitative variable like this, we are often interested in four key features:

🔗

Center: What would you consider a "typical" value in the distribution?

🔗
Variability: How clustered together or consistent are the observations? Or are they far apart?

🔗
Shape: Are some values more common than others? Are the values symmetric about the center?

🔗
Are there any unusual observations that don’t follow the overall pattern? Are there any explanations for these values?

🔗

To summarize the center of the distribution, we often report the mean (the arithmetic average of all the numerical values in the data set) and/or the median (a middle value such that 50% of the data values are smaller and 50% are larger).

🔗

With most investigations we will also provide a follow-up practice problem or two for you to try on your own to assess your understanding of the material.

🔗

Subsection 1.1.2 Practice Problem A.A

Using the Descriptive Statistics applet (below or follow the link to open in a new tab) to complete this practice problem.

🔗

Press the Clear button

🔗
In the Paste data box, type AtlanticStorms.txt.

🔗
Press the Use Data button (twice).

🔗
Use the Quantitative Variable pull-down menu to select the Number of Hurricanes variable (Hurricanes).

🔗

Shape of Distributions.

The shape of a distribution is often classified as symmetric (mirror image on each side of the center) or skewed.

🔗

The skewness statistic measures the lack of asymmetry in a distribution (due values above the mean extend further than values below the mean on average) using a \((y_i-\bar{y})^3\) term. Positive values indicate a skewed right distribution, negative values indicate skewed left, and values near 0 indicate a symmetric distribution.

🔗

Checkpoint 1.1.2. Describe Shape.

Based on the graph, would you consider the "number of hurricanes" distribution to be symmetric, skewed to the right or skewed to the left? Check the Skewness statistic box, does the value agree with your judgement from the graph?

🔗

Hint.

Look at the tails of the distribution. Which side extends further? A positive skewness value indicates right skew, negative indicates left skew, and values near 0 indicate symmetry.

🔗

Skewed to the right
Skewed to the left
Symmetric

Solution.

The distribution appears to be skewed to the right, with a longer tail extending toward the higher values. The skewness statistic should be positive, confirming this visual assessment.

🔗

Mean and Median.

If there are \(n\) numerical values and we refer to them as \(y_1, y_2, \ldots, y_n\text{,}\)

🔗

The mean, \(\bar{y}\text{,}\) is the average of all numerical values in the data set:

\begin{equation*} \bar{y} = \frac{\sum_{i=1}^n y_i}{n} \end{equation*}

🔗

The median is a value such that 50% of the data lies below and 50% of the data lies above that value: median position: \((n+1)/2\)

🔗

In the applet,

🔗

Check the box next to Guess for the Mean. Move the red line to where you think the mean of the distribution is.

🔗
Check the box next to Guess for the Median. Move the blue line to where you think the median of the distribution is.

🔗
Now check Actual for both.

🔗

🔗

Checkpoint 1.1.3. Explore the mean and median.

Which is larger, the mean or the median?

🔗

Hint.

In a right-skewed distribution, which measure of center is pulled more toward the tail?

🔗

The mean is larger
The median is larger
They are approximately equal

🔗

Checkpoint 1.1.4. Describe Variability.

In the applet, check the box for Guess for the standard deviation. Use your mouse to move one of the edges of the red rectangle to a distance that you think is representative of a "typical distance from the mean" (some values are closer, some are further). Then check the Actual box. How did you do?

🔗

Checkpoint 1.1.5. Compare Distributions.

How does the standard deviation of the full dataset compare to the 3.3 value for the years 1991-2020? Summarize what this tells us about the behavior of hurricanes.

🔗

You have attempted of activities on this page.

🔗

Prev Top Next