Skip to main content

Advanced High School Statistics: Third Edition

Section 5.3 Introducing hypothesis testing

In an experiment, one treatment reduces cholesterol by 10% while another treatment reduces it by 17%. Is this strong enough evidence that the second treatment is more effective? In this section, we will set up a framework for answering questions such as this and will look at the different types of decision errors that researcher can make when drawing conclusions based on data.

Subsection 5.3.1 Case study: medical consultant

People providing an organ for donation sometimes seek the help of a special medical consultant. These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant’s clients.
One consultant tried to attract patients by noting the overall complication rate for liver donor surgeries in the US is about 10%, but her clients have had only 9 complications in the 142 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).

Example 5.3.1.

We will let \(p\) represent the true complication rate for liver donors working with this consultant. Calculate the best estimate for \(p\) using the data. Label the point estimate as \(\hat{p}\text{.}\)
Solution.
The sample proportion for the complication rate is 9 complications divided by the 142 surgeries the consultant has worked on: \(\hat{p} = 9 / 142 = 0.063\text{.}\)

Example 5.3.2.

Is it possible to prove that the consultant’s work reduces complications?
Solution.
No. The claim implies that there is a causal connection, but the data are observational. For example, maybe patients who can afford a medical consultant can afford better medical care, which can also lead to a lower complication rate.

Example 5.3.3.

While it is not possible to assess the causal claim, it is still possible to ask whether the low complication rate of \(\hat{p} = 0.063\) provides evidence that the consultant’s true complication rate is different than the US complication rate. Why might we be tempted to immediately conclude that the consultant’s true complication rate is different than the US complication rate? Can we draw this conclusion?
Solution.
Her sample complication rate is \(\hat{p} = 0.063\text{,}\) which is 0.037 lower than the US complication rate of 10%. However, we cannot yet be sure if the observed difference represents a real difference or is just the result of random variation. We wouldn’t expect the sample proportion to be exactly 0.10, even if the truth was that her real complication rate was 0.10.

Subsection 5.3.2 Setting up the null and alternate hypothesis

We can set up two competing hypotheses about the consultant’s true complication rate. The first is call the null hypothesis and represents either a skeptical perspective or a perspective of no difference. The second is called the alternative hypothesis (or alternate hypothesis) and represents a new perspective such as the possibility that there has been a change or that there is a treatment effect in an experiment.

Null and alternative hypotheses.

The null hypothesis is abbreviated \(H_0\text{.}\) It represents a skeptical perspective and is often a claim of no change or no difference.
The alternative hypothesis is abbreviated \(H_A\text{.}\) It is the claim researchers hope to prove or find evidence for, and it often asserts that there has been a change or an effect.
Our job as data scientists is to play the skeptic: before we buy into the alternative hypothesis, we need to see strong supporting evidence.

Example 5.3.4.

Identify the null and alternative claim regarding the consultant’s complication rate.
Solution.
  • \(H_{0}\text{:}\) The true complication rate for the consultant’s clients is the same as the US complication rate of 10%.
  • \(H_{A}\text{:}\) The true complication rate for the consultant’s clients is different than 10%.
Often it is convenient to write the null and alternative hypothesis in mathematical or numerical terms. To do so, we must first identify the quantity of interest. This quantity of interest is known as the parameter for a hypothesis test.

Parameters and point estimates.

A parameter for a hypothesis test is the “true” value of the population of interest. When the parameter is a proportion, we call it \(p\text{.}\)
A point estimate is calculated from a sample. When the point estimate is a proportion, we call it \(\hat{p}\text{.}\)
The observed or sample proportion of 0.063 is a point estimate for the true proportion. The parameter in this problem is the true proportion of complications for this consultant’s clients. The parameter is unknown, but the null hypothesis is that it equals the overall proportion of complications: \(p = 0.10\text{.}\) This hypothesized value is called the null value.

Null value of a hypothesis test.

The null value is the value hypothesized for the parameter in \(H_0\text{,}\) and it is sometimes represented with a subscript 0, e.g. \(p_0\) (just like \(H_0\)).
In the medical consultant case study, the parameter is \(p\) and the null value is \(p_0 = 0.10\text{.}\) We can write the null and alternative hypothesis as numerical statements as follows.
  • \(H_0\text{:}\) \(p=0.10\) (The complication rate for the consultant’s clients is equal to the US complication rate of 10%.)
  • \(H_A\text{:}\) \(p \neq 0.10\) (The complication rate for the consultant’s clients is not equal to the US complication rate of 10%.)

Hypothesis testing.

These hypotheses are part of what is called a hypothesis test. A hypothesis test is a statistical technique used to evaluate competing claims using data. Often times, the null hypothesis takes a stance of no difference or no effect. If the null hypothesis and the data notably disagree, then we will reject the null hypothesis in favor of the alternative hypothesis.
Don’t worry if you aren’t a master of hypothesis testing at the end of this section. We’ll discuss these ideas and details many times in this chapter and the two chapters that follow.
The null claim is always framed as an equality: it tells us what quantity we should use for the parameter when carrying out calculations for the hypothesis test. There are three choices for the alternative hypothesis, depending upon whether the researcher is trying to prove that the value of the parameter is greater than, less than, or not equal to the null value.

Always write the null hypothesis as an equality.

We will find it most useful if we always list the null hypothesis as an equality (e.g. \(p = 7\)) while the alternative always uses an inequality (e.g. \(p \neq 0.7\text{,}\) \(p>0.7\text{,}\) or \(p\lt 0.7\)).

Guided Practice 5.3.5.

According to the 2010 US Census, 7.6% of residents in the state of Alaska were under 5 years old. A researcher plans to take a random sample of residents from Alaska to test whether or not this is still the case. Write out the hypotheses that the researcher should test in both plain and statistical language.
 1 
\(H_{0}: p=0.076\text{;}\) The proportion of residents under 5 years old in Alaska is unchanged from 2010. \(H_{A}: p \ne 0.076\text{;}\) The proportion of residents under 5 years old in Alaska has changed from 2010. Note that it could have increased or decreased. https://factfinder.census.gov/faces/nav/jsf/pages/community_facts.xhtml
When the alternative claim uses a \(\neq\text{,}\) we call the test a two-sided test, because either extreme provides evidence against \(H_0\text{.}\) When the alternative claim uses a \(\lt\) or a \(>\text{,}\) we call it a one-sided test.

One-sided and two-sided tests.

If the researchers are only interested in showing an increase or a decrease, but not both, use a one-sided test. If the researchers would be interested in any difference from the null value — an increase or decrease — then the test should be two-sided.

Example 5.3.6.

For the example of the consultant’s complication rate, we knew that her sample complication rate was 0.063, which was lower than the US complication rate of 0.10. Why did we conduct a two-sided hypothesis test for this setting?
Solution.
The setting was framed in the context of the consultant being helpful, but what if the consultant actually performed worse than the US complication rate? Would we care? More than ever! Since we care about a finding in either direction, we should run a two-sided test.

One-sided hypotheses are allowed only before seeing data.

After observing data, it is tempting to turn a two-sided test into a one-sided test. Avoid this temptation. Hypotheses must be set up before observing the data. If they are not, the test must be two-sided.

Subsection 5.3.3 Evaluating the hypotheses with a p-value

Example 5.3.7.

There were 142 patients in the consultant’s sample. If the null claim is true, how many would we expect to have had a complication?
Solution.
If the null claim is true, we would expect about 10% of the patients, or about 14.2 to have a complication.
The consultant’s complication rate for her 142 clients was 0.063 \((0.063 \times 142 \approx 9)\text{.}\) What is the probability that a sample would produce a number of complications this far from the expected value of 14.2, if her true complication rate were 0.10, that is, if \(H_0\) were true? The probability, which is estimated in Figure 5.3.8, is about 0.1754. We call this quantity the p-value.
Figure 5.3.8. The shaded area represents the p-value. We observed \(\hat{p} = 0.063\text{,}\) so any observations smaller than this are at least as extreme relative to the null value, \(p_0 = 0.1\text{,}\) and so the lower tail is shaded. However, since this is a two-sided test, values above 0.137 are also at least as extreme as 0.063 (relative to 0.1), and so they also contribute to the p-value. The tail areas together total of about 0.1754 when calculated using a simulation technique in Subsection 5.3.4.
Figure 5.3.9. When the alternative hypothesis takes the form \(p \lt\) null value, the p-value is represented by the lower tail. When it takes the form \(p >\) null value, the p-value is represented by the upper tail. When using \(p \neq\) null value, then the p-value is represented by both tails.

Finding and interpreting the p-value.

We find and interpret the p-value according to the nature of the alternative hypothesis.
  • \(H_A\text{:}\) parameter \(>\) null value. The p-value corresponds to the area in the upper tail and is probability of getting a test statistic larger than the observed test statistic if the null hypothesis is true and the probability model is accurate.
  • \(H_A\text{:}\) parameter \(\lt\) null value. The p-value corresponds to the area in the lower tail and is the probability of observing a test statistic smaller than the observed test statistic if the null hypothesis is true and the probability model is accurate.
  • \(H_A\text{:}\) parameter \(\ne\) null value. The p-value corresponds to the area in both tails and is the probability of observing a test statistic larger in magnitude than the observed test statistic if the null hypothesis is true and the probability model is accurate.
More generally, we can say that the p-value is the probability of getting a test statistic as extreme or more extreme than the observed test statistic in the direction of \(H_A\) if the null hypothesis is true and the probability model is accurate.
When working with proportions, we can also say that the p-value is the probability of getting a sample proportion as far from or farther from the null proportion in the direction of \(H_A\) if the null hypothesis is true and the normal model holds.
When the p-value is small, i.e. less than a previously set threshold, we say the results are statistically significant. This means the data provide such strong evidence against \(H_0\) that we reject the null hypothesis in favor of the alternative hypothesis. The threshold is called the significance level and is represented by \(\alpha\) (the Greek letter alpha ). The significance level is typically set to \(\alpha = 0.05\text{,}\) but can vary depending on the field or the application.

Statistical significance.

If the p-value is less than the significance level \(\alpha\) (usually 0.05), we say that the result is statistically significant. We reject \(H_0\text{,}\) and we have strong evidence favoring \(H_A\text{.}\)
If the p-value is greater than the significance level \(\alpha\text{,}\) we say that the result is not statistically significant. We do not reject \(H_0\text{,}\) and we do not have strong evidence for \(H_A\text{.}\)
Recall that the null claim is the claim of no difference. If we reject \(H_0\text{,}\) we are asserting that there is a real difference. If we do not reject \(H_0\text{,}\) we are saying that the null claim is reasonable, but we are not saying that the null claim has been proven.

Guided Practice 5.3.10.

Because the p-value is 0.1754, which is larger than the significance level 0.05, we do not reject the null hypothesis. Explain what this means in the context of the problem using plain language.
 2 
The data do not provide evidence that the consultant’s complication rate is significantly lower or higher than the US complication rate of 10%.

Example 5.3.11.

In the previous exercise, we did not reject \(H_0\text{.}\) This means that we did not disprove the null claim. Is this equivalent to proving the null claim is true?
Solution.
No. We did not prove that the consultant’s complication rate is exactly equal to 10%. Recall that the test of hypothesis starts by assuming the null claim is true. That is, the test proceeds as an argument by contradiction. If the null claim is true, there is a 0.1754 chance of seeing sample data as divergent from 10% as we saw in our sample. Because 0.1754 is large, it is within the realm of chance error, and we cannot say the null hypothesis is unreasonable.
 3 
The p-value is a conditional probability. It is \(P(\text{getting data at least as divergent from the null value as we observed } | H_{0} \text{ is true})\text{.}\) It is NOT \(P(H_{0} \text{ is true } | \text { we got data this divergent from the null value})\text{.}\)

Double negatives can sometimes be used in statistics.

In many statistical explanations, we use double negatives. For instance, we might say that the null hypothesis is not implausible or we failed to reject the null hypothesis. Double negatives are used to communicate that while we are not rejecting a position, we are also not saying that we know it to be true.

Example 5.3.12.

Does the conclusion in Guided Practice 5.3.10 ensure that there is no real association between the surgical consultant’s work and the risk of complications? Explain.
Solution.
No. It is possible that the consultant’s work is associated with a lower or higher risk of complications. If this was the case, the sample may have been too small to reliable detect this effect.

Example 5.3.13.

An experiment was conducted where study participants were randomly divided into two groups. Both were given the opportunity to purchase a DVD, but one half was reminded that the money, if not spent on the DVD, could be used for other purchases in the future, while the other half was not. The half that was reminded that the money could be used on other purchases was 20% less likely to continue with a DVD purchase. We determined that such a large difference would only occur about 1-in-150 times if the reminder actually had no influence on student decision-making. What is the p-value in this study? Was the result statistically significant?
Solution.
The p-value was 0.006 (about 1/150). Since the p-value is less than 0.05, the data provide statistically significant evidence that US college students were actually influenced by the reminder.

What’s so special about 0.05?

We often use a threshold of 0.05 to determine whether a result is statistically significant. But why 0.05? Maybe we should use a bigger number, or maybe a smaller number. If you’re a little puzzled, that probably means you’re reading with a critical eye — good job! We’ve made a video to help clarify why 0.05:
www.openintro.org/why05
 4 
www.openintro.org/book/stat/why05/
Sometimes it’s a good idea to deviate from the standard. We’ll discuss when to choose a threshold different than 0.05 in Subsection 5.3.7.
Statistical inference is the practice of making decisions and conclusions from data in the context of uncertainty. Just as a confidence interval may occasionally fail to capture the true value of the parameter, a test of hypothesis may occasionally lead us to an incorrect conclusion. While a given data set may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often these errors occur.

Subsection 5.3.4 Calculating the p-value by simulation (special topic)

When conditions for the applying a normal model are met, we use a normal model to find the p-value of a test of hypothesis. In the complication rate example, the distribution is not normal. It is, however, binomial, because we are interested in how many out of 142 patients will have complications.
We could calculate the p-value of this test using binomial probabilities. A more general approach, though, for calculating p-values when a normal model does not apply is to use what is known as simulation. While performing this procedure is outside of the scope of the course, we provide an example here in order to better understand the concept of a p-value.
We simulate 142 new patients to see what result might happen if the complication rate really is 0.10. To do this, we could use a deck of cards. Take one red card, nine black cards, and mix them up. If the cards are well-shuffled, drawing the top card is one way of simulating the chance a patient has a complication if the true rate is 0.10: if the card is red, we say the patient had a complication, and if it is black then we say they did not have a complication. If we repeat this process 142 times and compute the proportion of simulated patients with complications, \(\hat{p}_{sim}\text{,}\) then this simulated proportion is exactly a draw from the null distribution.
There were 12 simulated cases with a complication and 130 simulated cases without a complication: \(\hat{p}_{sim} = 12 / 142 = 0.085\text{.}\)
One simulation isn’t enough to get a sense of the null distribution, so we repeated the simulation 10,000 times using a computer. Figure 5.3.14 shows the null distribution from these 10,000 simulations. The simulated proportions that are less than or equal to \(\hat{p}=0.063\) are shaded. There were 0.0877 simulated sample proportions with \(\hat{p}_{sim} \leq 0.063\text{,}\) which represents a fraction 0.0877 of our simulations:
\begin{gather*} \text{ left tail } = \frac{\text{Number of observed simulations with } \hat{p}_{sim}\leq\text{ 0.063}}{10000} = \frac{877}{10000} = 0.0877 \end{gather*}
However, this is not our p-value! Remember that we are conducting a two-sided test, so we should double the one-tail area to get the p-value:
 5 
This doubling approach is preferred even when the distribution isn’t symmetric, as in this case.
\begin{gather*} \text{ p-value } = 2 \times \text{ left tail } = 2 \times 0.0877 = 0.1754 \end{gather*}
Figure 5.3.14. The null distribution for \(\hat{p}\text{,}\) created from 10,000 simulated studies. The left tail contains 8.77% of the simulations. For a two-sided test, we double the tail area to get the p-value. This doubling accounts for the observations we might have observed in the upper tail, which are also at least as extreme (relative to 0.10) as what we observed, \(\hat{p} = 0.063\text{.}\)

Subsection 5.3.5 Hypothesis testing: a five step process

Use a hypothesis test to test \(H_0\) versus \(H_A\) at a particular signficance level, \(\alpha\text{.}\)

(AP exam tip) When carrying out a hypothesis test procedure, follow these five steps:.

  • Identify: Identify the hypotheses and the significance level.
  • Choose: Choose the appropriate test procedure and identify it by name.
  • Check: Check that the conditions for the test procedure are met.
  • Calculate: Calculate the test statistic and the p-value.
    \begin{gather*} \text{ test statistic } = \frac{\text{ point estimate } - \text{ null value } }{SE\ \text{ of estimate } } \end{gather*}
  • Conclude: Compare the p-value to the significance level to determine whether to reject \(H_0\) or not reject \(H_0\text{.}\) Draw a conclusion in the context of \(H_A\text{.}\)

Subsection 5.3.6 Decision errors

The hypothesis testing framework is a very general tool, and we often use it without a second thought. If a person makes a somewhat unbelievable claim, we are initially skeptical. However, if there is sufficient evidence that supports the claim, we set aside our skepticism. The hallmarks of hypothesis testing are also found in the US court system.

Example 5.3.15.

A US court considers two possible claims about a defendant: she is either innocent or guilty. If we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative?
Solution.
The jury considers whether the evidence is so convincing (strong) that there is evidence beyond a reasonable doubt of the person’s guilt. That is, the starting assumption (null hypothesis) is that the person is innocent until evidence is presented that convinces the jury that the person is guilty (alternative hypothesis). In statistics, our evidence comes in the form of data, and we use the significance level to decide what is beyond a reasonable doubt.
Jurors examine the evidence to see whether it convincingly shows a defendant is guilty. Notice that a jury finds a defendant either guilty or not guilty. They either reject the null claim or they do not reject the null claim. They never prove the null claim, that is, they never find the defendant innocent. If a jury finds a defendant not guilty, this does not necessarily mean the jury is confident in the person’s innocence. They are simply not convinced of the alternative that the person is guilty.
This is also the case with hypothesis testing: even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as truth. Failing to find strong evidence for the alternative hypothesis is not equivalent to providing evidence that the null hypothesis is true.
Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free. Similarly, data can point to the wrong conclusion. However, what distinguishes statistical hypothesis tests from a court system is that our framework allows us to quantify and control how often the data lead us to the incorrect conclusion.
There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a statement about which one might be true, but we might choose incorrectly. There are four possible scenarios in a hypothesis test, which are summarized in Table 5.3.16.
Table 5.3.16. Four different scenarios for hypothesis tests.
Test conclusion
do not reject \(H_0\) reject \(H_0\) in favor of \(H_A\)
Truth \(H_0\) true correct conclusion Type I Error
\(H_A\) true Type II Error correct conclusion

Type I and Type II Errors.

A Type I Error is rejecting \(H_0\) when \(H_0\) is actually true. When we reject the null hypothesis, it is possible that we make a Type I Error.
A Type II Error is failing to reject \(H_0\) when \(H_A\) is actually true. When we did not reject the null hypothesis, it is possible that we make a Type II Error.

Example 5.3.17.

In a US court, the defendant is either innocent (\(H_0\)) or guilty (\(H_A\)). What does a Type I Error represent in this context? What does a Type II Error represent? Table 5.3.16 may be useful.
Solution.
If the court makes a Type I Error, this means the defendant is innocent (\(H_0\) true) but wrongly convicted. A Type II Error means the court failed to reject \(H_0\) (i.e. failed to convict the person) when they were in fact guilty (\(H_A\) true).

Example 5.3.18.

How could we reduce the Type I Error rate in US courts? What influence would this have on the Type II Error rate?
Solution.
To lower the Type I Error rate, we might raise our standard for conviction from “beyond a reasonable doubt” to “beyond a conceivable doubt” so fewer people would be wrongly convicted. However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type II Errors.

Guided Practice 5.3.19.

How could we reduce the Type II Error rate in US courts? What influence would this have on the Type I Error rate?
 6 
To lower the Type II Error rate, we want to convict more guilty people. We could lower the standards for conviction from “beyond a reasonable doubt” to “beyond a little doubt”. Lowering the bar for guilt will also result in more wrongful convictions, raising the Type I Error rate.

Guided Practice 5.3.20.

A group of women bring a class action lawsuit that claims discrimination in promotion rates. What would a Type I Error represent in this context?
 7 
We must first identify which is the null hypothesis and which is the alternative. The alternative hypothesis is the one that bears the burden of proof, so the null hypothesis is that there was no discrimination and the alternative hypothesis is that there was discrimination. Making a Type I Error in this context would mean that in fact there was no discrimination, even though we concluded that women were discriminated against. Notice that this does not necessarily mean something was wrong with the data or that we made a computational mistake. Sometimes data simply point us to the wrong conclusion, which is why scientific studies are often repeated to check initial findings.
These examples provide an important lesson: if we reduce how often we make one type of error, we generally make more of the other type.

Subsection 5.3.7 Choosing a significance level

If \(H_0\) is true, what is the probability that we will incorrectly reject it? In hypothesis testing, we perform calculations under the premise that \(H_0\) is true, and we reject \(H_0\) if the p-value is smaller than the significance level \(\alpha\text{.}\) That is, \(\alpha\) is the probability of making a Type I Error. The choice of what to make \(\alpha\) is not arbitrary. It depends on the gravity of the consequences of a Type I Error.

Relationship between Type I and Type II Errors.

The probability of a Type I Error is called \(\alpha\) and corresponds to the significance level of a test. The probability of a Type II Error is called \(\beta\text{.}\) As we make \(\alpha\) smaller, \(\beta\) typically gets larger, and vice versa.

Example 5.3.21.

If making a Type I Error is especially dangerous or especially costly, should we choose a smaller significance level or a higher significance level?
Solution.
Under this scenario, we want to be very cautious about rejecting the null hypothesis, so we demand very strong evidence before we are willing to reject the null hypothesis. Therefore, we want a smaller significance level, maybe \(\alpha = 0.01\text{.}\)

Example 5.3.22.

If making a Type II Error is especially dangerous or especially costly, should we choose a smaller significance level or a higher significance level?
Solution.
We should choose a higher significance level (e.g. 0.10). Here we want to be cautious about failing to reject \(H_0\) when the null is actually false.

Significance levels should reflect consequences of errors.

The significance level selected for a test should reflect the real-world consequences associated with making a Type I or Type II Error. If a Type I Error is very dangerous, make \(\alpha\) smaller.

Subsection 5.3.8 Statistical power of a hypothesis test

When the alternative hypothesis is true, the probability of not making a Type II Error is called power. It is common for researchers to perform a power analysis to ensure their study collects enough data to detect the effects they anticipate finding. As you might imagine, if the effect they care about is small or subtle, then if the effect is real, the researchers will need to collect a large sample size in order to have a good chance of detecting the effect. However, if they are interested in large effect, they need not collect as much data.
The Type II Error rate \(\beta\) and the magnitude of the error for a point estimate are controlled by the sample size. As the sample size \(n\) goes up, the Type II Error rate goes down, and power goes up. Real differences from the null value, even large ones, may be difficult to detect with small samples. However, if we take a very large sample, we might find a statistically significant difference but the size of the difference might be so small that it is of no practical value.

Subsection 5.3.9 Statistical significance versus practical significance

When the sample size becomes larger, point estimates become more precise and any real differences in the mean and null value become easier to detect and recognize. Even a very small difference would likely be detected if we took a large enough sample. Sometimes researchers will take such large samples that even the slightest difference is detected. While we still say that difference is statistically significant, it might not be practically significant.
Statistically significant differences are sometimes so minor that they are not practically relevant. This is especially important to research: if we conduct a study, we want to focus on finding a meaningful result. We don’t want to spend lots of money finding results that hold no practical value.
The role of a data scientist in conducting a study often includes planning the size of the study. The data scientist might first consult experts or scientific literature to learn what would be the smallest meaningful difference from the null value. She also would obtain some reasonable estimate for the standard deviation. With these important pieces of information, she would choose a sufficiently large sample size so that the power for the meaningful difference is perhaps 80% or 90%. While larger sample sizes may still be used, she might advise against using them in some cases, especially in sensitive areas of research.

Subsection 5.3.10 Statistical significance versus a real difference

When a result is statistically significant at the \(\alpha= 0.05\) level, we have evidence that the result is real. However, when there is no difference or effect, we can expect that 5% of the time the test conclusion will lead to a Type I Error and incorrectly reject the null hypothesis. Therefore we must beware of what is called p-hacking, in which researchers may test many, many hypotheses and then publish the ones that come out statistically significant. As we noted, we can expect 5% of the results to be significant when the null hypothesis is true and there really is no difference or effect.
 8 
The problem is even greater than p-hacking. In what has been called the “reproducibility crisis”, researchers have failed to reproduce a large proportion of results that were found significant and were published in scientific journals. This problem highlights the importance of research that reproduces earlier work rather than taking the word of a single study. Also keep in mind that the probability that a difference will be found to be significant given that there is no real difference is not the same as the probability that a difference is not real, given that it was found significant. Depending upon the veracity of the hypotheses tested, the latter can be upwards of 80%, leading some to assert that “most published research is false”. https://www.economist.com/briefing/2013/10/18/trouble-at-the-lab

Subsection 5.3.11 When to retreat

We must point out that statistical tools rely on conditions. When the conditions are not met, these tools are unreliable and drawing conclusions from them is treacherous. The conditions for these tools typically come in two forms.
  • The individual observations must be independent. A random sample from less than 10% of the population ensures the observations are independent. In experiments, we generally require that subjects are randomized into groups. If independence fails, then advanced techniques must be used, and in some such cases, inference may not be possible.
  • Other conditions focus on sample size and skew. For example, in Chapter 5 we looked at the success-failure condition and sample size condition for when \(\hat{p}\) and \(\bar{x}\) will follow a nearly normal distribution.
Verification of conditions for statistical tools is always necessary. When conditions are not satisfied for a given statistical technique, it is necessary to investigate new methods that are appropriate for the data.
Finally, we caution that there may be no inference tools helpful when considering data that include unknown biases, such as convenience samples. For this reason, there are books, courses, and researchers devoted to the techniques of sampling and experimental design. See Section 1.1, Section 1.2 and Section 1.3 for basic principles of data collection

Subsection 5.3.12 Section summary

  • A hypothesis test is a statistical technique used to evaluate competing claims based on data.
  • The competing claims are called hypotheses and are often about population parameters (e.g. \(\mu\) and \(p\)); they are never about sample statistics.
    • The null hypothesis is abbreviated \(H_0\text{.}\) It represents a skeptical perspective or a perspective of no difference or no change.
    • The alternative hypothesis is abbreviated \(H_A\text{.}\) It represents a new perspective or a perspective of a real difference or change. Because the alternative hypothesis is the stronger claim, it bears the burden of proof.
  • The logic of a hypothesis test: In a hypothesis test, we begin by assuming that the null hypothesis is true. Then, we calculate how unlikely it would be to get a sample value as extreme as we actually got in our sample, assuming that the null value is correct. If this likelihood is too small, it casts doubt on the null hypothesis and provides evidence for the alternative hypothesis.
  • We set a significance level, denoted \(\alpha\text{,}\) which represents the threshold below which we will reject the null hypothesis. The most common significance level is \(\alpha = 0.05\text{.}\) If we require more evidence to reject the null hypothesis, we use a smaller \(\alpha\text{.}\)
  • After verifying that the relevant conditions are met, we can calculate the test statistic. The test statistic tells us how many standard errors the point estimate (sample value) is from the null value (i.e. the value hypothesized for the parameter in the null hypothesis). When investigating a single mean or proportion or a difference of means or proportions, the test statistic is calculated as: \(\frac{\text{ point estimate } -\text{ null value } }{SE \text{ of estimate } }\text{.}\)
  • After the test statistic, we calculate the p-value. We find and interpret the p-value according to the nature of the alternative hypothesis. The three possibilities are:
    • \(H_A\text{:}\) parameter \(>\) null value. The p-value corresponds to the area in the upper tail.
    • \(H_A\text{:}\) parameter \(\lt\) null value. The p-value corresponds to the area in the lower tail.
    • \(H_A\text{:}\) parameter \(\ne\) null value. The p-value corresponds to the area in both tails.
    The p-value is the probability of getting a test statistic as extreme or more extreme than the observed test statistic in the direction of \(H_A\) if the null hypothesis is true and the probability model is accurate.
  • The conclusion or decision of a hypothesis test is based on whether the p-value is smaller or larger than the preset significance level \(\alpha\text{.}\)
    • When the p-value \(\lt \alpha\text{,}\) we say the results are statistically significant at the \(\alpha\) level and we have evidence of a real difference or change. The observed difference is beyond what would have been expected from chance variation alone. This leads us to reject \(H_0\) and gives us evidence for \(H_A\text{.}\)
    • When the p-value \(> \alpha\text{,}\) we say the results are not statistically significant at the \(\alpha\) level and we do not have evidence of a real difference or change. The observed difference was within the realm of expected chance variation. This leads us to not reject \(H_0\) and does not give us evidence for \(H_A\text{.}\)
  • Decision errors. In a hypothesis test, there are two types of decision errors that could be made. These are called Type I and Type II Errors.
    • A Type I Error is rejecting \(H_0\text{,}\) when \(H_0\) is actually true. We commit a Type I Error if we call a result significant when there is no real difference or effect. P(Type I error) = \(\alpha\text{.}\)
    • A Type II Error is not rejecting \(H_0\text{,}\) when \(H_A\) is actually true. We commit a Type II Error if we call a result not significant when there is a real difference or effect. P(Type II error) = \(\beta\text{.}\)
    • The probability of a Type I Error (\(\alpha\)) and a Type II Error (\(\beta\)) are inversely related. Decreasing \(\alpha\) makes \(\beta\) larger; increasing \(\alpha\) makes \(\beta\) smaller.
    • Once a decision is made, only one of the two types of errors is possible. If the test rejects \(H_0\text{,}\) for example, only a Type I Error is possible.
  • The power of a test.
    • When a particular \(H_A\) is true, the probability of not making a Type II Error is called power. Power \(= 1 - \beta\text{.}\)
    • The power of a test is the probability of detecting an effect of a particular size when it is present.
    • Increasing the significance level decreases the probability of a TypeII Error and increases power. \(\alpha \uparrow, \beta \downarrow, \text{ power } \uparrow\text{.}\)
    • For a fixed \(\alpha\text{,}\) increasing the sample size \(n\) makes it easier to detect an effect and therefore decreases the probability of a Type II Error and increases power. \(n \uparrow, \beta \downarrow, \text{ power } \uparrow\text{.}\)
  • A small percent of the time (\(\alpha\)), a significant result will not be a real result. If many tests are run, a small percent of them will produce significant results due to chance alone.
     9 
    Similarly, if many confidence intervals are constructed, a small percent \((100 - C\%)\) of them will fail to capture a true value due to chance alone. A value outside the confidence interval is not an impossible value.
  • With a very large sample, a significant result may point to a result that is real but not practically significant. That is, the difference detected may be so small as to be unimportant or meaningless.
  • The inference procedures in this book all require two broad conditions to be met. The first is that some type of random sampling or random assignment must be involved. The second condition focuses on sample size and skew to determine whether the point estimate follows the intended distribution.

Exercises 5.3.13 Exercises

1. Identify hypotheses, Part I.

Write the null and alternative hypotheses in words and then symbols for each of the following situations.
  1. A tutoring company would like to understand if most students tend to improve their grades (or not) after they use their services. They sample 200 of the students who used their service in the past year and ask them if their grades have improved or declined from the previous year.
  2. Employers at a firm are worried about the effect of March Madness, a basketball championship held each spring in the US, on employee productivity. They estimate that on a regular business day employees spend on average 15 minutes of company time checking personal email, making personal phone calls, etc. They also collect data on how much company time employees spend on such non-business activities during March Madness. They want to determine if these data provide convincing evidence that employee productivity changed during March Madness.
Solution.
  1. \(H_{0} : p = 0.5\) (Neither a majority nor minority of students’ grades improved) \(H_{A} : p \ne 0.5\) (Either a majority or a minority of students’ grades improved)
  2. \(H_{0} : \mu = 15\) (The average amount of company time each employee spends not working is 15 minutes for March Madness.) \(H_{A} : p \ne 15\) (The average amount of company time each employee spends not working is different than 15 minutes for March Madness.)

2. Identify hypotheses, Part II.

Write the null and alternative hypotheses in words and using symbols for each of the following situations.
  1. Since 2008, chain restaurants in California have been required to display calorie counts of each menu item. Prior to menus displaying calorie counts, the average calorie intake of diners at a restaurant was 1100 calories. After calorie counts started to be displayed on menus, a nutritionist collected data on the number of calories consumed at this restaurant from a random sample of diners. Do these data provide convincing evidence of a difference in the average calorie intake of a diners at this restaurant?
  2. The state of Wisconsin would like to understand the fraction of its adult residents that consumed alcohol in the last year, specifically if the rate is different from the national rate of 70%. To help them answer this question, they conduct a random sample of 852 residents and ask them about their alcohol consumption.

3. Online communication.

A study suggests that 60% of college student spend 10 or more hours per week communicating with others online. You believe that this is incorrect and decide to collect your own sample for a hypothesis test. You randomly sample 160 students from your dorm and find that 70% spent 10 or more hours a week communicating with others online. A friend of yours, who offers to help you with the hypothesis test, comes up with the following set of hypotheses. Indicate any errors you see.
\begin{align*} H_0\amp : \hat{p} \lt 0.6\\ H_A\amp : \hat{p} \gt 0.7 \end{align*}
Solution.
(1) The hypotheses should be about the population proportion \((p)\text{,}\) not the sample proportion. (2) The null hypothesis should have an equal sign. (3) The alternative hypothesis should have a not equals sign, and (4) it should reference the null value, \(p_{0} = 0.6\text{,}\) not the observed sample proportion. The correct way to set up these hypotheses is: \(H_{0} : p = 0.6\) and \(H_{A} : p \ne 0.6\text{.}\)

4. Married at 25.

A study suggests that the 25% of 25 year olds have gotten married. You believe that this is incorrect and decide to collect your own sample for a hypothesis test. From a random sample of 25 year olds in census data with size 776, you find that 24% of them are married. A friend of yours offers to help you with setting up the hypothesis test and comes up with the following hypotheses. Indicate any errors you see.
\begin{align*} H_0\amp : \hat{p} = 0.24\\ H_A\amp : \hat{p} \neq 0.24 \end{align*}

5. Unemployment and relationship problems.

A USA Today/Gallup poll asked a group of unemployed and underemployed Americans if they have had major problems in their relationships with their spouse or another close family member as a result of not having a job (if unemployed) or not having a full-time job (if underemployed). 27% of the 1,145 unemployed respondents and 25% of the 675 underemployed respondents said they had major problems in relationships as a result of their employment status.
  1. What are the hypotheses for evaluating if the proportions of unemployed and underemployed people who had relationship problems were different?
  2. The p-value for this hypothesis test is approximately 0.35. Explain what this means in context of the hypothesis test and the data.
Solution.
  1. \(H_{0}: p_{unemp}=p_{underemp}\text{:}\) The proportions of unemployed and underemployed people whoare having relationship problems are equal. \(H_{0}: p_{unemp} \neq p_{underemp}\text{:}\) The proportions of unemployed and underemployed people who are having relationship problems are different.
  2. If in fact the two population proportions are equal, the probability of observing at least a 2% difference between the sample proportions is approximately 0.35. Since this is a high probability we fail to reject the null hypothesis. The data do not provide convincing evidence that the proportion of of unemployed and underemployed people who are having relationship problems are different.

6. Which is higher?

In each part below, there is a value of interest and two scenarios (I and II). For each part, report if the value of interest is larger under scenario I, scenario II, or whether the value is equal under the scenarios.
  1. The standard error of \(\hat{p}\) when (I) \(n = 125\) or (II) \(n = 500\text{.}\)
  2. The margin of error of a confidence interval when the confidence level is (I) 90% or (II) 80%.
  3. The p-value for a Z-statistic of 2.5 calculated based on a (I) sample with \(n = 500\) or based on a (II) sample with \(n = 1000\text{.}\)
  4. The probability of making a Type 2 Error when the alternative hypothesis is true and the significance level is (I) 0.05 or (II) 0.10.

7. Testing for Fibromyalgia.

A patient named Diana was diagnosed with Fibromyalgia, a long-term syndrome of body pain, and was prescribed anti-depressants. Being the skeptic that she is, Diana didn’t initially believe that anti-depressants would help her symptoms. However after a couple months of being on the medication she decides that the anti-depressants are working, because she feels like her symptoms are in fact getting better.
  1. Write the hypotheses in words for Diana’s skeptical position when she started taking the anti-depressants.
  2. What is a Type 1 Error in this context?
  3. What is a Type 2 Error in this context?
Solution.
  1. \(H_{0}\text{:}\) Anti-depressants do not affect the symptoms of Fibromyalgia. \(H_{A}\text{:}\) Anti-depressants do affect the symptoms of Fibromyalgia (either helping or harming).
  2. Concluding that anti-depressants either help or worsen Fibromyalgia symptoms when they actually do neither.
  3. Concluding that anti-depressants do not affect Fibromyalgia symptoms when they actually do.

Subsection 5.3.14 Chapter Highlights

Statistical inference is the practice of making decisions from data in the context of uncertainty. In this chapter, we introduced two frameworks for inference: confidence intervals and hypothesis tests.
  • Confidence intervals are used for estimating unknown population parameters by providing an interval of reasonable values for the unknown parameter with a certain level of confidence.
  • Hypothesis tests are used to assess how reasonable a particular value is for an unknown population parameter by providing degrees of evidence against that value.
  • The results of confidence intervals and hypothesis tests are, generally speaking, consistent.
     10 
    In the context of proportions there will be a small range of cases where this is not true. This is because when working with proportions, the \(SE\) used for confidence intervals and the \(SE\) used for tests are slightly different, as we will see in the next chapter.
    That is:
    • Values that fall inside a 95% confidence interval (implying they are reasonable) will not be rejected by a test at the 5% significance level (implying they are reasonable), and vice-versa.
    • Values that fall outside a 95% confidence interval (implying they are not reasonable) will be rejected by a test at the 5% significance level (implying they are not reasonable), and vice-versa.
    • When the confidence level and the significance level add up to 100%, the conclusions of the two procedures are consistent.
  • Many values fall inside of a confidence interval and will not be rejected by a hypothesis test. “Not rejecting \(H_0\)” is NOT equivalent to accepting \(H_0\text{.}\) When we “do not reject \(H_0\)”, we are asserting that the null value is reasonable, not that the parameter is exactly equal to the null value.
  • For a 95% confidence interval, 95% is not the probability that the true value lies inside the confidence interval (it either does or it doesn’t). Likewise, for a hypothesis test, \(\alpha\) is not the probability that \(H_0\) is true (it either is or it isn’t). In both frameworks, the probability is about what would happen in a random sample, not about what is true of the population.
  • The confidence interval procedures and hypothesis tests described in this book should not be applied unless particular conditions (described in more detail in the following chapters) are met. If these procedures are applied when the conditions are not met, the results may be unreliable and misleading.
While a given data set may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often errors occur.
You have attempted of activities on this page.