Preface¶
Preface from the Second Edition¶
by Jan Pearce and Jacqueline Boggs
We are excited to bring you this enhanced version of this book. As we were planning to teach a course in data analytics, a course which is cross listed in computer science and business at our institution, we found it quite challenging to identify a book that had appropriate content for this type of interdisciplinary course. We were so very excited to find this open source book due to the clear focus on the data. We both believe that curiosity is exactly what drives data science and data analytics. When we encounter a set of data, it leads us to ask provocative questions that can often be answered by the data techniques covered in this book.
As professors, we believe it is crucially important that students build life-long learning skills. We have found that it is sometimes difficult for students to transfer learning to another area/topic/dataset. For these reasons, we wanted to add some additional datasets into this book, so we could help students learn to better apply and transfer their knowledge.
Some of the key changes from the First Edition include:
Learning Goals, Learning Objectives, and Glossaries added to each chapter.
Chapter titles that identify the data technique to be utilized while still letting curiosity about each of the datasets drive the exploration.
The fourth chapter has been significantly expanded to include a targetted introduction/review of Python.
The option to choose to use Google Colaboratory Notebooks or an Anaconda installation using Jupyter Notebooks.
Additional datasets presented as case studies that focus on business applications added in addition to the existing case studies on other interesting topics.
One can find data science offered by departments such as computer science, math or statistics, as well as business, so this edition strives to appeal to the interests of students in each of these disciplines. Of course, the applications of data science are even broader and have broad application across the entire curriculum. Our best hope is that the second edition of this text can be used for courses in Data Science, Data Analytics, Business Analytics, and possibly beyond!
We hope you like it and would love to hear from you!
Preface from the First Edition¶
by Brad Miller
It is said that the most important characteristic of a data scientist is curiosity. Curiosity has certainly led me on a path of discovery throughout the world of data science and many fascinating data sets that I have encountered. So, the premise of this book is to let the data sets lead you to learning. The best and most interesting way to learn is to find some data and then begin to ask questions about it an analyze it, visualize it, and then write down new questions that have occurred to you as you have been doing your initial analysis.
This is how I organized the first two data science courses I ever taught, and surprisingly it worked. In fact it worked so well that I would never want to teach it any other way. Nevertheless it may not be clear from a high level look at the table of contents what this course covers and the learning goals it strives to achieve. So let me lay it out for you in a different organization.
Learning Objectives¶
Articulate the data science processing pipeline
Extract data using SQL
Gather data from the Internet using web API’s and screen scraping
combine data from different sources
Clean the data
Handle missing data/finding outliers/fixing data
Normalize and rescaling data
Visualize the data
Translate questions to analysis and analysis to interesting stories
Analyze data
Single variable regression, logistic regression
Market basket analysis
Cohort analysis
Sentiment analysis, exposure to Bayes Theorem
Time series
Geographic analysis
Simulations, Monte Carlo
Understand statistical significance and how to test for it using practical simulation techniques.
More Traditional Topic Outline¶
Data Gathering
Using Web APIs
reading CSV files
Screen Scraping
Reading data from relational databases with SQL
Data Munging
dealing with missing data
string processing
regular expressions
re-encoding data (one-hot)
re-scaling data
Data Querying
filter
group by and aggregation
joining
sorting
reshaping
pivoting
Analytical techniques
Linear Regression
Sentiment analysis
Market basket analysis
Cohort analysis
Time series
Visualization
Understanding Distributions
Histogram
Box and whisker plot
Violin plot
Understanding relationships
scatter plot
bubble plot
heat map
Network diagrams
chord charts
Making Comparisons
bar chart / stacked bar chart
line chart
spider plot
Geographic analysis
Choropleth maps