Section A.1 Data sets within the text
Each data set within the text is described in this appendix. For those data sets that are in multiple sections in a chapter, only the first section is listed in that chapter. If a data set is not listed here, e.g. Chapter 3 Bayes’ Theorem lists imagined probabilities for whether a parking garage will fill up and whether there is a sporting event that same evening for an unnamed college, it may not be listed in this data appendix. When a raw data set is available vs just a description, there is a corresponding page for the data set at openintro.org/data. That webpage also includes many more data sets than are covered in this textbook, and each data set on the website includes a description, it’s source, a detailed overview of each data set’s variables, and download options.
Subsection A.1.1 Chapter 1: Data Collection
In Section 1.1:
stent30
, stent365
\(\rightarrow\)The stent data is split across two data sets, one for the 0-30 day and one for the 0-365 day results. Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Medical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993-1003. >www.nejm.org/doi/full/10.1056/NEJMoa1105335. NY Times article: www.nytimes.com/2011/09/08/health/research/08stent.html.
In Section 1.2:
loan50
, loan_full_schema
\(\rightarrow\) This data comes from Lending Club (lendingclub.com), which provides a large set of data on the people who received loans through their platform. The data used in the textbook comes from a sample of the loans made in Q1 (Jan, Feb, March) 2018.
In Section 1.2:
county
, county_complete
\(\rightarrow\) These data come from several government sources. For those variables included in the county data set, only the most recent data is reported, as of what was available in late 2018. Data prior to 2011 is all from census.gov, where the specific Quick Facts page providing the data is no longer available. The more recent data comes from USDA (ers.usda.gov), Bureau of Labor Statistics (bls.gov/lau), SAIPE (census.gov/did/www/saipe), and American Community Survey (census.gov/programs-surveys/acs).
In Section 1.4 The study in mind regarding chocolate and heart attack patients: Janszky et al. 2009. Chocolate consumption and mortality following a first acute myocardial infarction: the Stockholm Heart Epidemiology Program. Journal of Internal Medicine 266:3, p248-257.
In Section 1.4: The Nurses’ Health Study was mentioned. For more information on this data set, see www.channing.harvard.edu/nhs
In Section 1.5: The study we had in mind when discussing the simple randomization (no blocking) study was Anturane Reinfarction Trial Research Group. 1980. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256
Subsection A.1.2 Chapter 2: Summarizing Data
In Section 2.1:
email50
, email
\(\rightarrow\text{.}\) These data represent emails sent to David Diez. Each data set includes 21 variables. The email50
data set is a random sample of 50 emails from email
.
In Section 2.2:
loan50
, county
\(\rightarrow\) These data sets are described in the data for Chapter 1. email50
, email
\(\rightarrow\) These data sets are described in the data for Section 2.1.
In Section 2.2: 2019 mean and median income https://data.census.gov/table/ACSST1Y2019.S1901?hidePreview=true
In Section 2.2:
possum
\(\rightarrow\) The brushtail possum statistics are based on a sample of possums from Australia and New Guinea. The original source of this data is as follows: Lindenmayer DB, et al. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.
In Section 2.3: SAT and ACT score distributions \(\rightarrow\) The SAT score data comes from the 2018 distribution, which is provided at https://reports.collegeboard.org/pdf/2018-total-group-sat-suite-assessments-annual-report.pdf#page=4&zoom=auto,-63,775. The ACT score data is available at https://www.act.org/content/dam/act/unsecured/documents/cccr2018/P_99_999999_N_S_N00_ACT-GCPR_National.pdf#page=15. We also acknowledge that the actual ACT score distribution is not nearly normal. However, since the topic is very accessible, we decided to keep the context and examples.
In Section 2.3:
nba_players_19
\(\rightarrow\) Summary information from the NBA players for the 2018-2019 season. Data were retrieved from www.nba.com/players.
In Section 2.4:
loans_full_schema
\(\rightarrow\) This data set is described in the data for Chapter 1.
In Section 2.5:
malaria
\(\rightarrow\) Lyke et al. 2017. PfSPZ vaccine induces strain-transcending T cells and durable protection against heterologous controlled human malaria infection. PNAS 114(10):2711-2716. www.pnas.org/content/114/10/2711
Subsection A.1.3 Chapter 3: Probability
In Section 3.2: Machine learning on fashion. \(\rightarrow\) This is a simulated data set, not based on any specific machine learning classifier.
In Section 3.2:
smallpox
\(\rightarrow\) Fenner F. 1988. Smallpox and Its Eradication (History of International Public Health, No. 6). Geneva: World Health Organization. ISBN 92-4-156110-6.
In Section 3.2:
family_college
\(\rightarrow\) A simulated data set based on real population summaries at nces.ed.gov/pubs2001/2001126.pdf.
In Section 3.2: Mammogram screening, probabilities. \(\rightarrow\) The probabilities reported were obtained using studies reported at www.breastcancer.org and www.ncbi.nlm.nih.gov/pmc/articles/PMC1173421.
In Section 3.4:
stocks_18
\(\rightarrow\) Monthly returns for Caterpillar, Exxon Mobil Corp, and Google for November 2015 to October 2018.
In Section 3.5: Blood type prevalence. \(\rightarrow\) The fraction of people with O+ blood is about 38% according to https://www.redcrossblood.org/donate-blood/blood-types/o-blood-type.html We used 35% for simplicity in the examples.
Subsection A.1.4 Chapter 4: Distributions of random variables
In Section 4.1: Blood type prevalence. \(\rightarrow\) This data set is described in the data for Chapter 3.
In Section 4.2:
run17
, run17samp
\(\rightarrow\) These data set represent the full population and a sample of the runners and their run times in the 2017 Cherry Blossom Run in Washington, DC. For more details, see www.cherryblossom.org
In Section 4.2:
poker
\(\rightarrow\) The full data set includes poker winnings (and losses) for 50 days by a professional poker player, which represents their first 50 days trying to play for a living. Anonymity has been requested by the player.
Subsection A.1.5 Chapter 5: Foundations for inference
In Section 5.1:
pew_energy_2018
\(\rightarrow\) The actual data has more observations than were referenced in this chapter. That is, we used a subsample since it helped smooth some of the examples to have a bit more variability. The pew_energy_2018
data set represents the full data set for each of the different energy source questions, which covers solar, wind, offshore drilling, hydrolic fracturing, and nuclear energy. The statistics used to construct the data are from the following page: www.pewinternet.org/2018/05/14/majorities-see-government-efforts-to-protect-the-environment-as-insufficient/
In Section 5.2:
pew_energy_2018
\(\rightarrow\) See the details for this data set above in Section 5.1 data section.
In Section 5.2:
ebola_survey
\(\rightarrow\) In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll found that 82% of New Yorkers favored a “mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient”. This poll included responses of 1,042 New York adults between Oct 26th and 28th, 2014. Poll ID NY141026 on maristpoll.marist.edu.
In Section 5.3:
transplant
\(\rightarrow\) This is a made up data set about the health outcomes for a hypothetical medical consultant. Note that the data set on the website has 62 patients, not 142 patients, so there will a difference for what is covered in this book vs the data set on the website.
In Section 5.3: Alaska residents under 5 years old. \(\rightarrow\) The 2010 statistic comes from the US census: https://data.census.gov.
Subsection A.1.6 Chapter 6: Inference for categorical data
In Section 6.1: Supreme Court \(\rightarrow\) The Gallup organization began measuring the public’s view of the Supreme Court’s job performance in 2000, and has measured it every year since then with the question: “Do you approve or disapprove of the way the Supreme Court is handling its job?”. In 2018, the Gallup poll randomly sampled 1,033 adults in the U.S. and found that 53% of them approved. https://news.gallup.com/poll/237269/supreme-court-approval-highest-2009.aspx
In Section 6.1: Life on other planets \(\rightarrow\) A February 2018 Marist Poll reported: “Many Americans (68%) think there is intelligent life on other planets”. The results were based on a random sample of 1,033 adults in the U.S. http://maristpoll.marist.edu/212-are-americans-poised-for-an-alien-invasion
In Section 6.1: Congressional approval rating. \(\rightarrow\) This survey data is from https://news.gallup.com/poll/237176/snapshot-congressional-job-approval-july.aspx
In Section 6.1: Tire inspection. \(\rightarrow\) This is a hypothetical scenario not based on real data.
In Section 6.1: Toohey poll. \(\rightarrow\) This is a hypothetical scenario not based on a real person or real data.
In Section 6.1: Support for nuclear energy. \(\rightarrow\) The results are from the following Gallup poll: https://news.gallup.com/poll/190064/first-time-majority-oppose-nuclear-energy.aspx
In Section 6.2:
cpr
\(\rightarrow\) Böttiger et al. Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial. The Lancet, 2001.
In Section 6.2:
gear_company
\(\rightarrow\) This is a hypothetical scenario not based on real data.
In Section 6.2:
healthcare_law_survey
\(\rightarrow\) Pew research survey on the Affordable Care Act (aka Obamacare) that ran the survey question with two variants. https://www.pewresearch.org/politics/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/
In Section 6.2:
fish_oil_18
\(\rightarrow\) Manson JE, et al. 2018. Marine n-3 Fatty Acids and Prevention of Cardiovascular Disease and Cancer. NEJMoa1811403.
In Section 6.3:
jury
\(\rightarrow\) Simulated data set of registered voter proportions and representation on juries from a population.
In Section 6.3: M&Ms \(\rightarrow\) Rick Wicklin collected a sample of 712 candies, or about 1.5 pounds, and counted how many there were of each color. https://qz.com/918008/the-color-distribution-of-mms-as-determined-by-a-phd-in-statistics
In Section 6.4:
ask
\(\rightarrow\) Experiment results from asking about iPods, where the original source is: Minson JA, Ruedy NE, Schweitzer ME. There is such a thing as a stupid question: Question disclosure in strategic communication. opim.wharton.upenn.edu/DPlab/papers/workingPapers/Minson working Ask%20(the%20Right%20Way)%20and%20You%20Shall%20Receive.pdf
In Section 6.4: Obama and Congressional approval by political affiliation \(\rightarrow\) This survey was completed by Pew Research and the full results may be found at: https://www.pewresearch.org/politics/2012/03/14/romney-leads-gop-contest-trails-in-matchup-with-obama/
In Section 6.4: Attitudes on climate change \(\rightarrow\) A Pew Research poll published in May of 2021 looks at how Americans’ attitudes about climate change differ by generation, party and other factors https://www.pewresearch.org/short-reads/2021/05/26/key-findings-how-americans-attitudes-about-climate-change-differ-by-generation-party-and-other-factors/
Subsection A.1.7 Chapter 7: Inference for numerical data
In Section 7.1: Risso’s dolphins \(\rightarrow\) Endo T and Haraguchi K. 2009. High mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin 60(5):743-747. Taiji was featured in the movie The Cove, and it is a significant source of dolphin and whale meat in Japan. Thousands of dolphins pass through the Taiji area annually, and we will assume these 19 dolphins represent a simple random sample from those dolphins.
In Section 7.1: Croaker white fish \(\rightarrow\) www.fda.gov/food/foodborneillnesscontaminants/metals/ucm115644.htm
In Section 7.1:
run17samp
\(\rightarrow\) This data set is described in the data for ch_distributions 8.2.22.
In Section 7.2:
textbooks
, ucla_textbooks_f18
\(\rightarrow\) Data were collected by OpenIntro staff in 2010 and again in 2018. For the 2018 sample, we sampled 201 UCLA courses. Of those, 68 required books that could be found on Amazon. The websites where information was retrieved: sa.ucla.edu/ro/public/soc, ucla.verbacompare.com and amazon.com.
In Section 7.2:
sat_improve
\(\rightarrow\) This is a hypothetical (fake) data set for SAT improvement from an SAT preparation company.
In Section 7.3: Jennifer-John \(\rightarrow\) Bertrand M, Mullainathan S. 2004. Science faculty’s subtle gender biases favor male students. PNAS October 9, 2012 109 (41) 16474-16479. https://www.pnas.org/content/109/41/16474
In Section 7.3:
resume
\(\rightarrow\) Bertrand M, Mullainathan S. 2004. Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. The American Economic Review 94:4 (991-1013). www.nber.org/papers/w9873
In Section 7.3: Exams variants. \(\rightarrow\) This is a simulated (fake) data set for exam performance of students for two different exam variations.
In Section 7.3:
ncbirths
\(\rightarrow\) A random sample of 1000 NC births. A sample of that random sample was used for the example in the section.
In Section 7.3:
stem_cells
\(\rightarrow\) Menard C, et al. 2005. Transplantation of cardiac-committed mouse embryonic stem cells to infarcted sheep myocardium: a preclinical study. The Lancet: 366:9490, p1005-1012. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(05)67380-1/fulltext
Subsection A.1.8 Chapter 8: Introduction to linear regression
In Section 8.1:
simulated_scatter
\(\rightarrow\) Fake data used for the first three plots. The perfect linear plot uses group 4 data, where group variable in the data set (Figure 8.1.1). The group of 3 imperfect linear plots use groups 1-3 (Figure 8.1.2). The sinusoidal curve uses group 5 data (Figure 8.1.3). The group of 3 scatterplots with residual plots use groups 6-8 (Figure 8.1.13). The correlation plots uses groups 9-19 data (Figure 8.1.14 and Figure 8.1.16).
In Section 8.1:
simulated_scatter
\(\rightarrow\) The plots for things that can go wrong uses groups 20-23 Figure 8.4.1
In Section 8.2:
elmhurst
\(\rightarrow\) These data were sampled from a table of data for all freshman from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education: chronicle.com/article/What-Students-Really-Pay-to-Go/131435.
In Section 8.2:
textbooks
, ucla_textbooks_f18
\(\rightarrow\) This data is described in the data for Chapter 7.
In Section 8.2:
mariokart
\(\rightarrow\) Auction data from Ebay (ebay.com) for the game Mario Kart for the Nintendo Wii. This data set was collected in early October, 2009.
In Section 8.2:
simulated_scatter
\(\rightarrow\) The plots for types of outliers uses groups 24-29 from Example 8.2.22.
In Section 8.3:
county
, county_complete
\(\rightarrow\) These data sets are described in the data for Chapter 1
You have attempted of activities on this page.