.. Copyright (C) Google, Runestone Interactive LLC
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/4.0/.
Module B
========
Project Description
-------------------
In this project, you will complete a statistical analysis on a dataset, then
write a report summarizing your findings.
Here are the steps you’ll need to complete. **Under each step are sub-bullets
detailing questions you need to answer in your report.**
**Step 1: Choosing a Dataset**
In any data science project, you will need good, reliable data. There are many
data sources available online. If there is a topic you are interested in, there
is most likely a corresponding dataset. It is important that the dataset you
choose does not have any copyright restrictions and is trustworthy. You can
check on the website to see if the dataset is licensed and what restrictions
there may be on using it. In most cases, using the data and drawing conclusions
from it is just fine, but republishing the data itself is not allowed.
**To Do:**
- Form a group of no more than 4 people. (You may want to form a group with
people who are in a similar major or have similar interests as this may make
dataset selection easier).
- Find a dataset to analyze online. If you are unsure where to start looking,
check out the hints section for some places to start. If you have questions
about your dataset, ask an instructor before proceeding.
- **Deliverable:** State and describe your dataset. Why did you choose this
particular dataset?
*Hints:*
- Places to find free datasets: `World Bank Open Data`_, `FiveThirtyEight`_,
`Kaggle`_.
- `Additional list of websites to find datasets`_.
- Revisit the importing data section from module B for review on how to import
data.
**Step 2: Clean Your Dataset**
Only rarely will datasets ever be ready for analysis right away. Therefore, you
will need to prepare your dataset to make useful observations. Don’t worry if it
seems overwhelming at first. You want to remove outliers that may skew your
dataset, while still maintaining the integrity of your data. Some steps have
been provided to help with this process. Once you clean your dataset, you should
be able to find sections of the data that are interesting and find relevant
relationships.
**To Do:**
- First look at your data and see if you can spot any inconsistencies.
- Filter out unwanted outliers.
- Check for missing data values
- **Deliverable:** Summarize your data cleaning process and make sure to
answer the following questions.
- Are there any ethical issues with the way you cleaned the data? What are
they?
- Did you make any trade-offs as you were cleaning? What were these tradeoffs
and why did you make this decision?
*Hints:*
- For a more in-depth explanation of data cleaning `read this.`_
**Step 3: Consolidate and Summarize your Data**
After cleaning up your data, you will want an overview of what your data is
saying. Sometimes when working with a large dataset, your data will be split
across different sheets, making it difficult to find summary statistics. If
that is the case, you will need to join common sections first before finding
interesting statistics about your data.
**To Do:**
- Join the multiple sheets across a common column if necessary. You will need to
join datasets using VLOOKUP.
- Using the joined dataset, find summary statistics for the population.
- **Deliverables:** Describe how you consolidated your data and chose your
subsections. Make sure the following questions are answered in your
discussion.
- If you needed to join your data across sheets, what did you choose as the
joining key? Why?
- For all of the numeric variables, what are the population-wide mean, median,
variance, standard deviation?
*Hints:*
- Revisit :ref:`the section on joining data `.
- Look at :ref:`the section on measures of center for mean and median `.
- Review :ref:`the section on measures of spread for variance and standard deviation `.
**Step 4: Choosing Subsets**
Sometimes you can find interesting trends in subsets of the data rather than the
whole dataset. For example, if you are looking at data about each of the 50
states in the United States, you can find interesting summary statistics about
the west coast states as compared to the east coast states.
**To Do:**
- Choose subsets of the data that you find interesting. Find summary statistics
for the numeric variables, within those subsets of the data.
- From these subsets, create a pivot table and a visualization that compares
summary statistics across groups.
- **Deliverables:** Continue your discussion section. Describe the subsets you
chose from the dataset and include your pivot table and visualizations. Make
sure the following questions are addressed.
- Explain why you chose this set of groups.
- Within subsets of interest, what are the count, mean, median, variance,
standard deviation?
- Is the sample size enough within each group? What does this imply for
reliability of summary statistics, and for privacy considerations?
- What comparisons are particularly interesting? Why?
*Hints:*
- `Disadvantages of a small sample size.`_
- `Refresher on data privacy.`_
**Step 5: Analyze your Data**
Now that you’ve looked into some subsets of data, it’s time to be more
quantitative in your analysis. Look for relationships in the data and use these
relationships to make predictions.
**To Do:**
- Determine two quantitative variables that have either a strong or an
interesting relationship.
- Identify any potential lurking variables.
- Fit a regression on the data and find the equation for the line of best fit.
- Interpret the coefficients of the linear model, in the context of the chosen
variables.
- Choose some data points to predict using your regression.
- **Deliverable:** Write the analysis section of your paper using what you have
already done above. In addition, in a short paragraph, report your predictions
in the context of the problem. Make sure the following questions are
addressed.
- How did you identify lurking variables?
- Does the line of best fit fit the data well? If not, why not? If the result
is surprising, what is surprising and why?
- Include references to “correlation” and “causation” effects.
- Is your prediction logical?
*Hints:*
- When reporting predictions here are some examples:
- Good example: We predict that someone with a shoe size of 6.5 will be 5’4”.
- Inadequate example: Only reporting the point (6.5, 64”).
- Revisit :ref:`the section on causation vs. correlation `.
**Step 6: Conclude and reflect**
The power of data science is that you can get meaningful takeaways from
statistics that can help you make a positive impact on society. Now that you’ve
done data analysis, take a moment to reflect on your findings and think about
the broader implications.
**To Do:**
- Include a conclusion summarizing your findings.
- Who does this affect?
- What did you learn?
- Proofread your report.
- **Deliverable:** Write the conclusion section of your paper. Submit your
report and your sheets reflecting your analysis by [Due Date].
*Hints:*
- `Examples of reports backed by data science`_.
**Optional** (faculty can decide whether to include or not): After completing
and submitting your project, complete the group work self assessment and group
assessment.
Grading Rubric
--------------
.. list-table::
:widths: 20 20 20 20 20
:header-rows: 1
:stub-columns: 1
:align: left
* -
- **Excellent**
- **Developing**
- **Beginning**
- **NA / Not Present**
* - **Dataset (2)**
- Report includes a rationale for why the dataset was chosen. If students
selected a different dataset, the dataset must have been approved by the
instructor.
-
- Report does not include a rationale for why the dataset was chosen.
- The dataset was not approved by the instructor.
* - **Data Cleaning (8)**
- All missing/unclean data is found and accounted for in a way that makes
sense. The report references data types, any ethical tradeoffs, and
outlines what steps were taken and why.
- Some crucial steps are not taken. Steps outlined to clean the data are
ambiguous.
- There is an attempt at data cleaning, but it does not get far. Large
chunks of missing/unclean data are untreated. Key steps of cleaning
process were not reported.
- Report does not include any reference to data cleaning (independently of
whether data cleaning was done).
* - **Joining (4)**
- An appropriate join key was chosen. VLOOKUP was used successfully to
create a joined table. The report contains a brief mention of why this
key was chosen.
- An appropriate join key was chosen and the join is successfully executed
using VLOOKUP, but the report does not include any discussion of why this
key was chosen.
- There was an attempt at joining, but the wrong formula was used or the
wrong key was used.
- There was no attempt at using VLOOKUP.
* - **Population Summary Statistics (6)**
- The summary statistics are accurately calculated and reported. There is
some comment on what these values mean for the distribution.
- Almost all of the important summary statistics are correctly calculated
and reported.
- There is an attempt at calculating summary statistics, but they are
incorrect or not referenced in the report.
- There is no attempt at calculating the population summary statistics.
* - **Grouped Summary Statistics (8)**
- A pivot table was used to calculate relevant summary statistics per
group. The pivot table is presented in the report in a clean way.
There is some other visualization showing some important summary
statistics. There is some mention of sample size within groups, as well
as why the specific grouping was chosen.
- There is a working attempt at a pivot table, and it is presented in the
report. Not all numbers are accurate, and there is no extra
visualization. There is some mention on sample size within groups.
- There is an attempt at a pivot table, but it uses the wrong dimensions
and measures. The grouped summary statistics are incorrect or
non-existent.
- There is no attempt at a pivot table.
* - **Regression (8)**
- Report includes both the scatter plot and the line-of-best-fit equation,
and these values are (close to) correct. The report includes a discussion
of why the particular variables were chosen, the meaning of the
coefficients, and correlation versus causation. There is some mention of
whether regression is appropriate for the sample size.
- The line of best fit is not completely correct The scatter plot is
missing from or wrongly formatted in the report The discussion on
variable selection, coefficient interpretation, and correlation vs.
causation is not sufficiently detailed or accurate.
- There is some attempt at a line of best fit, but the values are
completely wrong. The scatter plot or the equation are not included.
There is no proper discussion on variable selection, coefficient
interpretation, or correlation vs causation.
- There is no attempt at fitting a regression.
* - **Prediction (6)**
- The equation of the line of best fit is used to predict these values. The
report correctly identifies and explains which points are suitable for
prediction. The ethics of prediction are mentioned, and the report includes
the pros and cons of using a linear regression to predict.
- Values are chosen for prediction that are largely appropriate. The report
struggles with why some points are not suitable for prediction. There is
some mention of the ethics of using prediction from a linear model.
- There is an unsuccessful attempt at prediction. There is little or no
mention of suitability of prediction of certain points, or the chosen
points are not usable with this model.
- There is no attempt at prediction using the line of best fit.
* - **Conclusion (4)**
- The report contains a conclusion section summarizing key findings from
other rubric areas. It is concise and complete.
- The report contains a conclusion section, but either contains minor
inconsistencies with previous findings, or omits relevant findings.
- The report contains a conclusion section, but it is incomplete or doesn’t
accurately reflect previous findings.
- The report does not contain a conclusion section.
* - **Readability (4)**
- The report is structured by section, with appropriate headings. The
report has very few spelling/grammar errors.
-
- The report’s structure lacks clarity or is otherwise difficult to read.
The report has several spelling/ grammar errors.
- There is no report.
* - **Total (50)**
-
-
-
-
**Optional** (faculty can choose whether to include or not): `Here`_ is an
example project.
.. _World Bank Open Data: https://data.worldbank.org/
.. _FiveThirtyEight: https://data.fivethirtyeight.com/
.. _Kaggle: https://www.kaggle.com/datasets
.. _Additional list of websites to find datasets: https://www.dataquest.io/blog/free-datasets-for-projects/
.. _read this.: https://elitedatascience.com/data-cleaning
.. _Disadvantages of a small sample size.: https://sciencing.com/disadvantages-small-sample-size-8448532.html
.. _Refresher on data privacy.: https://www.siliconrepublic.com/enterprise/ethics-data-science-bias
.. _Examples of reports backed by data science: https://www.un.org/en/climatechange/reports.shtml
.. _Here : https://docs.google.com/document/d/1BcaI1J1xG1_deyl-x7rZoMIepgOiVmtoxnQSiU96Ah8/edit?usp=sharing