Skip to main content

Advanced High School Statistics: Third Edition

Section A.1 Data sets within the text

Each data set within the text is described in this appendix. For those data sets that are in multiple sections in a chapter, only the first section is listed in that chapter. If a data set is not listed here, e.g. Chapter 3 Bayes’ Theorem lists imagined probabilities for whether a parking garage will fill up and whether there is a sporting event that same evening for an unnamed college, it may not be listed in this data appendix. When a raw data set is available vs just a description, there is a corresponding page for the data set at openintro.org/data
 1 
openintro.org/data
. That webpage also includes many more data sets than are covered in this textbook, and each data set on the website includes a description, it’s source, a detailed overview of each data set’s variables, and download options.

Subsection A.1.1 Chapter 1: Data Collection

In Section 1.1: stent30
 2 
www.openintro.org/data/index.php?data=stent30
, stent365
 3 
www.openintro.org/data/index.php?data=stent365
\(\rightarrow\)The stent data is split across two data sets, one for the 0-30 day and one for the 0-365 day results. Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Medical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993-1003. >www.nejm.org/doi/full/10.1056/NEJMoa1105335
 4 
www.nejm.org/doi/full/10.1056/NEJMoa1105335
. NY Times article: www.nytimes.com/2011/09/08/health/research/08stent.html
 5 
www.nytimes.com/2011/09/08/health/research/08stent.html
.
In Section 1.2: loan50
 6 
www.openintro.org/data/index.php?data=loan50
, loan_full_schema
 7 
www.openintro.org/data/index.php?data=loans_full_schema
\(\rightarrow\) This data comes from Lending Club (lendingclub.com
 8 
www.lendingclub.com/
), which provides a large set of data on the people who received loans through their platform. The data used in the textbook comes from a sample of the loans made in Q1 (Jan, Feb, March) 2018.
In Section 1.2: county
 9 
www.openintro.org/data/index.php?data=county
, county_complete
 10 
www.openintro.org/data/index.php?data=county_complete
\(\rightarrow\) These data come from several government sources. For those variables included in the county data set, only the most recent data is reported, as of what was available in late 2018. Data prior to 2011 is all from census.gov
 11 
www.census.gov/
, where the specific Quick Facts page providing the data is no longer available. The more recent data comes from USDA (ers.usda.gov)
 12 
www.ers.usda.gov/data-products/county-level-data-sets/download-data/
, Bureau of Labor Statistics (bls.gov/lau)
 13 
www.bls.gov/lau/
, SAIPE (census.gov/did/www/saipe)
 14 
www.census.gov/programs-surveys/saipe.html
, and American Community Survey (census.gov/programs-surveys/acs)
 15 
www.census.gov/programs-surveys/acs/
.
In Section 1.4 The study in mind regarding chocolate and heart attack patients: Janszky et al. 2009. Chocolate consumption and mortality following a first acute myocardial infarction: the Stockholm Heart Epidemiology Program
 16 
onlinelibrary.wiley.com/doi/full/10.1111/j.1365-2796.2009.02088.x/
. Journal of Internal Medicine 266:3, p248-257.
In Section 1.4: The Nurses’ Health Study was mentioned. For more information on this data set, see www.channing.harvard.edu/nhs
 17 
www.nurseshealthstudy.org/
In Section 1.5: The study we had in mind when discussing the simple randomization (no blocking) study was Anturane Reinfarction Trial Research Group. 1980. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256

Subsection A.1.2 Chapter 2: Summarizing Data

In Section 2.1: county
 18 
www.openintro.org/data/index.php?data=county
\(\rightarrow\) This data set is described in the data for Chapter 1.
In Section 2.1: email50
 19 
www.openintro.org/data/index.php?data=email50
, email
 20 
www.openintro.org/data/index.php?data=email
\(\rightarrow\text{.}\) These data represent emails sent to David Diez. Each data set includes 21 variables. The email50 data set is a random sample of 50 emails from email.
In Section 2.2: loan50
 21 
www.openintro.org/data/index.php?data=loan50
, county
 22 
www.openintro.org/data/index.php?data=county
\(\rightarrow\) These data sets are described in the data for Chapter 1. email50
 23 
www.openintro.org/data/index.php?data=email50
, email
 24 
www.openintro.org/data/index.php?data=email
\(\rightarrow\) These data sets are described in the data for Section 2.1.
In Section 2.2: 2019 mean and median income https://data.census.gov/table/ACSST1Y2019.S1901?hidePreview=true
 25 
data.census.gov/table/ACSST1Y2019.S1901?hidePreview=true
In Section 2.2: possum
 26 
www.openintro.org/data/index.php?data=possum
\(\rightarrow\) The brushtail possum statistics are based on a sample of possums from Australia and New Guinea. The original source of this data is as follows: Lindenmayer DB, et al. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.
In Section 2.3: SAT and ACT score distributions \(\rightarrow\) The SAT score data comes from the 2018 distribution, which is provided at https://reports.collegeboard.org/pdf/2018-total-group-sat-suite-assessments-annual-report.pdf#page=4&zoom=auto,-63,775
 27 
reports.collegeboard.org/pdf/2018-total-group-sat-suite-assessments-annual-report.pdf#page=4&zoom=auto,-63,775
. The ACT score data is available at https://www.act.org/content/dam/act/unsecured/documents/cccr2018/P_99_999999_N_S_N00_ACT-GCPR_National.pdf#page=15
 28 
www.act.org/content/dam/act/unsecured/documents/cccr2018/P_99_999999_N_S_N00_ACT-GCPR_National.pdf#page=15
. We also acknowledge that the actual ACT score distribution is not nearly normal. However, since the topic is very accessible, we decided to keep the context and examples.
In Section 2.3: nba_players_19
 29 
www.openintro.org/data/index.php?data=nba_players_19
\(\rightarrow\) Summary information from the NBA players for the 2018-2019 season. Data were retrieved from www.nba.com/players
 30 
www.nba.com/players
.
In Section 2.4: loans_full_schema
 31 
www.openintro.org/data/index.php?data=loans_full_schema
\(\rightarrow\) This data set is described in the data for Chapter 1.
In Section 2.5: malaria
 32 
www.openintro.org/data/index.php?data=malaria
\(\rightarrow\) Lyke et al. 2017. PfSPZ vaccine induces strain-transcending T cells and durable protection against heterologous controlled human malaria infection. PNAS 114(10):2711-2716. www.pnas.org/content/114/10/2711
 33 
www.pnas.org/content/114/10/2711

Subsection A.1.3 Chapter 3: Probability

In Section 3.1: email
 34 
www.openintro.org/data/index.php?data=email
\(\rightarrow\) This data set is described in the data for Chapter 2.
In Section 3.1: playing_cards
 35 
www.openintro.org/data/index.php?data=playing_cards
\(\rightarrow\) A table of the 52 cards in a standard deck.
In Section 3.2: Machine learning on fashion. \(\rightarrow\) This is a simulated data set, not based on any specific machine learning classifier.
In Section 3.2: smallpox
 36 
www.openintro.org/data/index.php?data=smallpox
\(\rightarrow\) Fenner F. 1988. Smallpox and Its Eradication (History of International Public Health, No. 6). Geneva: World Health Organization. ISBN 92-4-156110-6.
In Section 3.2: family_college
 37 
www.openintro.org/data/index.php?data=family_college
\(\rightarrow\) A simulated data set based on real population summaries at nces.ed.gov/pubs2001/2001126.pdf
 38 
nces.ed.gov/pubs2001/2001126.pdf
.
In Section 3.2: Mammogram screening, probabilities. \(\rightarrow\) The probabilities reported were obtained using studies reported at www.breastcancer.org
 39 
www.breastcancer.org/
and www.ncbi.nlm.nih.gov/pmc/articles/PMC1173421
 40 
www.ncbi.nlm.nih.gov/pmc/articles/PMC1173421/
.
In Section 3.4: stocks_18
 41 
www.openintro.org/data/index.php?data=stocks_18
\(\rightarrow\) Monthly returns for Caterpillar, Exxon Mobil Corp, and Google for November 2015 to October 2018.
In Section 3.4: stocks_18
 42 
www.openintro.org/data/index.php?data=stocks_18
\(\rightarrow\)
In Section 3.5: Blood type prevalence. \(\rightarrow\) The fraction of people with O+ blood is about 38% according to https://www.redcrossblood.org/donate-blood/blood-types/o-blood-type.html
 43 
www.redcrossblood.org/donate-blood/blood-types/o-blood-type.html
We used 35% for simplicity in the examples.

Subsection A.1.4 Chapter 4: Distributions of random variables

In Section 4.1: Blood type prevalence. \(\rightarrow\) This data set is described in the data for Chapter 3.
In Section 4.2: run17
 44 
www.openintro.org/data/index.php?data=run17
, run17samp
 45 
www.openintro.org/data/index.php?data=run17samp
\(\rightarrow\) These data set represent the full population and a sample of the runners and their run times in the 2017 Cherry Blossom Run in Washington, DC. For more details, see www.cherryblossom.org
 46 
www.cherryblossom.org
In Section 4.2: poker
 47 
www.openintro.org/data/index.php?data=poker
\(\rightarrow\) The full data set includes poker winnings (and losses) for 50 days by a professional poker player, which represents their first 50 days trying to play for a living. Anonymity has been requested by the player.

Subsection A.1.5 Chapter 5: Foundations for inference

In Section 5.1: email
 48 
www.openintro.org/data/index.php?data=email
\(\rightarrow\) This data set is described in the data for Chapter 2.
In Section 5.1: pew_energy_2018
 49 
www.openintro.org/data/index.php?data=pew_energy_2018
\(\rightarrow\) The actual data has more observations than were referenced in this chapter. That is, we used a subsample since it helped smooth some of the examples to have a bit more variability. The pew_energy_2018 data set represents the full data set for each of the different energy source questions, which covers solar, wind, offshore drilling, hydrolic fracturing, and nuclear energy. The statistics used to construct the data are from the following page: www.pewinternet.org/2018/05/14/majorities-see-government-efforts-to-protect-the-environment-as-insufficient/
 50 
www.pewresearch.org/science/2018/05/14/majorities-see-government-efforts-to-protect-the-environment-as-insufficient/
In Section 5.2: pew_energy_2018
 51 
www.openintro.org/data/index.php?data=pew_energy_2018
\(\rightarrow\) See the details for this data set above in Section 5.1 data section.
In Section 5.2: ebola_survey
 52 
www.openintro.org/data/index.php?data=ebola_survey
\(\rightarrow\) In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll found that 82% of New Yorkers favored a “mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient”. This poll included responses of 1,042 New York adults between Oct 26th and 28th, 2014. Poll ID NY141026 on maristpoll.marist.edu
 53 
maristpoll.marist.edu/wp-content/misc/nyspolls/NY141026/Cuomo/Complete%20NBC%204%20NY_WSJ_Marist%20Poll%20New%20York%20State%20Release%20and%20Tables_October%202014.pdf
.
In Section 5.3: transplant
 54 
www.openintro.org/data/index.php?data=transplant
\(\rightarrow\) This is a made up data set about the health outcomes for a hypothetical medical consultant. Note that the data set on the website has 62 patients, not 142 patients, so there will a difference for what is covered in this book vs the data set on the website.
In Section 5.3: Alaska residents under 5 years old. \(\rightarrow\) The 2010 statistic comes from the US census: https://data.census.gov
 55 
data.census.gov/table/DECENNIALDPCD1132010.113DP1?q=alaska%20age%202010%20census&hidePreview=false
.

Subsection A.1.6 Chapter 6: Inference for categorical data

In Section 6.1: Supreme Court \(\rightarrow\) The Gallup organization began measuring the public’s view of the Supreme Court’s job performance in 2000, and has measured it every year since then with the question: “Do you approve or disapprove of the way the Supreme Court is handling its job?”. In 2018, the Gallup poll randomly sampled 1,033 adults in the U.S. and found that 53% of them approved. https://news.gallup.com/poll/237269/supreme-court-approval-highest-2009.aspx
 56 
news.gallup.com/poll/237269/supreme-court-approval-highest-2009.aspx
In Section 6.1: Life on other planets \(\rightarrow\) A February 2018 Marist Poll reported: “Many Americans (68%) think there is intelligent life on other planets”. The results were based on a random sample of 1,033 adults in the U.S. http://maristpoll.marist.edu/212-are-americans-poised-for-an-alien-invasion
 57 
maristpoll.marist.edu/212-are-americans-poised-for-an-alien-invasion/#sthash.VrjaqJNS.Pyp2lgqf.dpbs
In Section 6.1: Congressional approval rating. \(\rightarrow\) This survey data is from https://news.gallup.com/poll/237176/snapshot-congressional-job-approval-july.aspx
 58 
news.gallup.com/poll/237176/snapshot-congressional-job-approval-july.aspx
In Section 6.1: Tire inspection. \(\rightarrow\) This is a hypothetical scenario not based on real data.
In Section 6.1: Toohey poll. \(\rightarrow\) This is a hypothetical scenario not based on a real person or real data.
In Section 6.1: Support for nuclear energy. \(\rightarrow\) The results are from the following Gallup poll: https://news.gallup.com/poll/190064/first-time-majority-oppose-nuclear-energy.aspx
 59 
news.gallup.com/poll/190064/first-time-majority-oppose-nuclear-energy.aspx
In Section 6.2: cpr
 60 
www.openintro.org/data/index.php?data=cpr
\(\rightarrow\) Böttiger et al. Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial. The Lancet, 2001.
In Section 6.2: gear_company
 61 
www.openintro.org/data/index.php?data=gear_company
\(\rightarrow\) This is a hypothetical scenario not based on real data.
In Section 6.2: healthcare_law_survey
 62 
www.openintro.org/data/index.php?data=healthcare_law_survey
\(\rightarrow\) Pew research survey on the Affordable Care Act (aka Obamacare) that ran the survey question with two variants. https://www.pewresearch.org/politics/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/
 63 
www.pewresearch.org/politics/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/
In Section 6.2: fish_oil_18
 64 
www.openintro.org/data/index.php?data=fish_oil_18
\(\rightarrow\) Manson JE, et al. 2018. Marine n-3 Fatty Acids and Prevention of Cardiovascular Disease and Cancer. NEJMoa1811403.
In Section 6.3: jury
 65 
www.openintro.org/data/index.php?data=jury
\(\rightarrow\) Simulated data set of registered voter proportions and representation on juries from a population.
In Section 6.3: M&Ms \(\rightarrow\) Rick Wicklin collected a sample of 712 candies, or about 1.5 pounds, and counted how many there were of each color. https://qz.com/918008/the-color-distribution-of-mms-as-determined-by-a-phd-in-statistics
 66 
qz.com/918008/the-color-distribution-of-mms-as-determined-by-a-phd-in-statistics/
In Section 6.4: gsearch
 67 
www.openintro.org/data/index.php?data=gsearch
\(\rightarrow\) Simulated (fake) data set for Google search experiment.
In Section 6.4: ask
 68 
www.openintro.org/data/index.php?data=ask
\(\rightarrow\) Experiment results from asking about iPods, where the original source is: Minson JA, Ruedy NE, Schweitzer ME. There is such a thing as a stupid question: Question disclosure in strategic communication. opim.wharton.upenn.edu/DPlab/papers/workingPapers/Minson working Ask%20(the%20Right%20Way)%20and%20You%20Shall%20Receive.pdf
 69 
www.acrwebsite.org/volumes/1012889/volumes/v40/NA-40
In Section 6.4: Obama and Congressional approval by political affiliation \(\rightarrow\) This survey was completed by Pew Research and the full results may be found at: https://www.pewresearch.org/politics/2012/03/14/romney-leads-gop-contest-trails-in-matchup-with-obama/
 70 
www.pewresearch.org/politics/2012/03/14/romney-leads-gop-contest-trails-in-matchup-with-obama/
In Section 6.4: Attitudes on climate change \(\rightarrow\) A Pew Research poll published in May of 2021 looks at how Americans’ attitudes about climate change differ by generation, party and other factors https://www.pewresearch.org/short-reads/2021/05/26/key-findings-how-americans-attitudes-about-climate-change-differ-by-generation-party-and-other-factors/
 71 
www.pewresearch.org/short-reads/2021/05/26/key-findings-how-americans-attitudes-about-climate-change-differ-by-generation-party-and-other-factors/

Subsection A.1.7 Chapter 7: Inference for numerical data

In Section 7.1: Risso’s dolphins \(\rightarrow\) Endo T and Haraguchi K. 2009. High mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin 60(5):743-747. Taiji was featured in the movie The Cove, and it is a significant source of dolphin and whale meat in Japan. Thousands of dolphins pass through the Taiji area annually, and we will assume these 19 dolphins represent a simple random sample from those dolphins.
In Section 7.1: Croaker white fish \(\rightarrow\) www.fda.gov/food/foodborneillnesscontaminants/metals/ucm115644.htm
 72 
www.fda.gov/food/metals/mercury-levels-commercial-fish-and-shellfish-1990-2012
In Section 7.1: run17samp
 73 
www.openintro.org/data/index.php?data=run17samp
\(\rightarrow\) This data set is described in the data for ch_distributions 8.2.22.
In Section 7.2: textbooks
 74 
www.openintro.org/data/index.php?data=textbooks
, ucla_textbooks_f18
 75 
www.openintro.org/data/index.php?data=ucla_textbooks_f18
\(\rightarrow\) Data were collected by OpenIntro staff in 2010 and again in 2018. For the 2018 sample, we sampled 201 UCLA courses. Of those, 68 required books that could be found on Amazon. The websites where information was retrieved: sa.ucla.edu/ro/public/soc
 76 
sa.ucla.edu/ro/public/soc
, ucla.verbacompare.com
 77 
ucla.verbacompare.com/
and amazon.com
 78 
www.amazon.com/
.
In Section 7.2: sat_improve
 79 
www.openintro.org/data/index.php?data=sat_improve
\(\rightarrow\) This is a hypothetical (fake) data set for SAT improvement from an SAT preparation company.
In Section 7.3: Jennifer-John \(\rightarrow\) Bertrand M, Mullainathan S. 2004. Science faculty’s subtle gender biases favor male students. PNAS October 9, 2012 109 (41) 16474-16479. https://www.pnas.org/content/109/41/16474
 80 
www.pnas.org/content/109/41/16474
In Section 7.3: resume
 81 
www.openintro.org/data/index.php?data=resume
\(\rightarrow\) Bertrand M, Mullainathan S. 2004. Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. The American Economic Review 94:4 (991-1013). www.nber.org/papers/w9873
 82 
www.nber.org/papers/w9873
In Section 7.3: Exams variants. \(\rightarrow\) This is a simulated (fake) data set for exam performance of students for two different exam variations.
In Section 7.3: ncbirths
 83 
www.openintro.org/data/index.php?data=ncbirths
\(\rightarrow\) A random sample of 1000 NC births. A sample of that random sample was used for the example in the section.
In Section 7.3: stem_cells
 84 
www.openintro.org/data/index.php?data=stem_cell
\(\rightarrow\) Menard C, et al. 2005. Transplantation of cardiac-committed mouse embryonic stem cells to infarcted sheep myocardium: a preclinical study. The Lancet: 366:9490, p1005-1012. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(05)67380-1/fulltext
 85 
www.thelancet.com/journals/lancet/article/PIIS0140-6736(05)67380-1/fulltext

Subsection A.1.8 Chapter 8: Introduction to linear regression

In Section 8.1: simulated_scatter
 86 
www.openintro.org/data/index.php?data=simulated_scatter
\(\rightarrow\) Fake data used for the first three plots. The perfect linear plot uses group 4 data, where group variable in the data set (Figure 8.1.1). The group of 3 imperfect linear plots use groups 1-3 (Figure 8.1.2). The sinusoidal curve uses group 5 data (Figure 8.1.3). The group of 3 scatterplots with residual plots use groups 6-8 (Figure 8.1.13). The correlation plots uses groups 9-19 data (Figure 8.1.14 and Figure 8.1.16).
In Section 8.1: possum
 87 
www.openintro.org/data/index.php?data=possum
\(\rightarrow\) The data is described in the data for Chapter 2
In Section 8.1: simulated_scatter
 88 
www.openintro.org/data/index.php?data=simulated_scatter
\(\rightarrow\) The plots for things that can go wrong uses groups 20-23 Figure 8.4.1
In Section 8.2: elmhurst
 89 
www.openintro.org/data/index.php?data=elmhurst
\(\rightarrow\) These data were sampled from a table of data for all freshman from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education: chronicle.com/article/What-Students-Really-Pay-to-Go/131435
 90 
www.chronicle.com/article/What-Students-Really-Pay-to-Go/131435
.
In Section 8.2: textbooks
 91 
www.openintro.org/data/index.php?data=textbooks
, ucla_textbooks_f18
 92 
www.openintro.org/data/index.php?data=ucla_textbooks_f18
\(\rightarrow\) This data is described in the data for Chapter 7.
In Section 8.2: loan50
 93 
www.openintro.org/data/index.php?data=loan50
\(\rightarrow\) This data is described in the data for Chapter 1.
In Section 8.2: mariokart
 94 
www.openintro.org/data/index.php?data=mariokart
\(\rightarrow\) Auction data from Ebay (ebay.com) for the game Mario Kart for the Nintendo Wii. This data set was collected in early October, 2009.
In Section 8.2: simulated_scatter
 95 
www.openintro.org/data/index.php?data=simulated_scatter
\(\rightarrow\) The plots for types of outliers uses groups 24-29 from Example 8.2.22.
In Section 8.3: county
 96 
www.openintro.org/data/index.php?data=county
, county_complete
 97 
www.openintro.org/data/index.php?data=county_complete
\(\rightarrow\) These data sets are described in the data for Chapter 1
In Section 8.4: midterms_house
 98 
www.openintro.org/data/index.php?data=county
\(\rightarrow\) Data was retrieved from Wikipedia.
You have attempted of activities on this page.