7.2. Big Data¶
Time Estimate: 45 minutes
7.2.1. Introduction and Goals¶
We live in the information age with an exponential growth of data. In 2010 Eric Schmidt, the CEO of Google, said, "There were five exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days." In 2019, the World Economic Forum estimated that "the entire digital universe is expected to reach 44 zettabytes by 2020."
How much is an Exabyte or Zettabyte? Here is a visualization and a table from the same article at the World Economic Forum. Click on each to view full-size versions.
- describe what information can be extracted from data
- identify what qualifies as big data
- describe challenges associated with processing big data sets
- recognize both benefits and harms of using big data
- discuss privacy and security concerns related to a data set
- use target vocabulary, such as megabyte, gigabyte, and terabyte while describing the effects of big data, with the support of concept definitions from this lesson
7.2.2. Learning Activities¶
Big Data
We live in the era of Big Data which refers to data sets that are too large to fit on a normal computer or be processed by a standard spreadsheet or database program. Large data sets are difficult to process using a single computer and may require parallel systems (multiple computers working together to run an algorithm). Scalability of systems is an important consideration when working with large data sets, as the computational capacity of a system affects how data sets can be processed and stored.
We will explore Big Data through a number of videos from the PBS documentary, The Human Face of Big Data. We will start with a short (2:31) video, Everything Is Quantifiable.
- True
- We’re in the learning zone today. Mistakes are our friends! A terabyte is actually much larger and is equivalent to 1 trillion bytes!
- False
- That's right! A Terabyte is extremely large. One Terabyte is equivalent to 1 trillion bytes!
Q-1:
True or False: A Terabyte is equivalent to 1000 bytes.
- True
- Big data can also refer to large complex data made up of more than just numbers, like the images, audio, video and text we share on social media.
- False
- Big data can also refer to large complex data made up of more than just numbers, like the images, audio, video and text we share on social media.
Q-2:
True or False: Big data only contains numeric data, it does not include text, images or videos.
- data sets that contain very large numbers
- OK, so you didn’t get it right this time. Let’s look at this as an opportunity to learn. Try reviewing this; some Big Data sets do contain very large number, such as 1,980,000,000.3021342, but all Big Data sets do not contain very large numbers.
- data sets that are owned by a big corporation
- OK, so you didn’t get it right this time. Let’s look at this as an opportunity to learn. Try reviewing this; you may find that some Big Data sets are owned by big corporations such as banks or oil companies, but you can also find Big Data sets that are owned by small corporations or even individuals.
- data sets that are stored in the cloud
- OK, so you didn’t get it right this time. Let’s look at this as an opportunity to learn. Try reviewing this; not all Big Data is stored in the cloud. Some companies save their Big Data in Excel spreadsheets on a hard drive in other databases.
- data sets that are too large and complex to download and process on a single computer
- That's right! Big data sets are extremely large sets of data that are very complex.
Q-3:
The term Big Data refers to _________________.
Data Science
The field of Data Science deals with extracting information from and visualizing the results of manipulating large data sets. The size of a data set affects the amount and quality of information that can be extracted from it. From this information, further analysis may yield knowledge or even wisdom. Tables, diagrams, text, and other visual tools can be used to communicate insight and knowledge gained from data. We often think of data, information, knowledge and wisdom forming a pyramid.
Data provide opportunities for identifying trends, making connections, and addressing problems. Computing enables new methods of deriving information from data, driving monumental change across many disciplines — from art to business to science. Keep the DIKW pyramid in mind as you watch the short 3 minute video, Learning Revealed: Acquiring Language.
- Information: The child said "water" most frequently in the kitchen and the bathroom
- Knowledge: The child is likely to learn words heard in multiple locations
- Data: The child said "Truck" for the first time at 11:45 on January 15, 2017
- Data is basic facts or figures,information is data that has been organized or visualized,knowledge extracts generalizations from information
- Information: The child said "water" most frequently in the kitchen and the bathroom
- Data: The child is likely to learn words heard in multiple locations
- Knowledge: The child said "Truck" for the first time at 11:45 on January 15, 2017
- Data is basic facts such as when each word was spoken, not summary information.
- Data: The child said "water" most frequently in the kitchen and the bathroom
- Knowledge: The child is likely to learn words heard in multiple locations
- Information: The child said "Truck" for the first time at 11:45 on January 15, 2017
- Data is basic facts such as when each word was spoken, not generalize knowledge.
Q-4: Which of the following best matches statements from the video to the Data-Information-Knowledge-Wisdom pyramid?
- Data science refers to scientific information that is gained from scientific experiments.
- Data science is more broad than just data from scientific experiments.
- Data science refers to manipulating large data sets to gain information from them.
- Data science refers to data published along with peer-reviewed scientific research
- Data science is more broad than just data from scientific research.
Q-5: What does “data science” refer to?
Impacts of Big Data
Careful analysis of data can help us solve many problems. Watch the following 4-minute video to see how tracking data on The Smallest Heartbeat can help save a child's life.
Bias in Data
The path from data to information to knowledge is not always straightforward. Bias can be introduced into the collection and analysis of data with dangerous results. Care must be taken when collecting and analyzing data. Problems of bias are often caused by the type or source of data that is being collected. Bias is not eliminated by simply collecting more data.
Joy Buolamwini from the MIT Media labs studies the impact of bias in face recognition systems. Watch the following video about her research.
The following spoken word piece by Joy Buolamwini highlights how computer systems based on incomplete data misinterpret the images of iconic black women.
- True
- False
Q-8:
True or False: When Joy Buolamwini says that current face recognition systems are "pale and male" she means that since the data used to train these systems consisted largely of white, male faces, these systems perform poorly for other faces.
- Retraining did not improve the system.
- The bias in the system was nearly entirely removed by retraining.
- Retraining the system made the bias worse.
Q-9: Based on the Joy Buolamwini’s research, IBM retrained its system using a more diverse set of faces. How would you interpret the new results?
Big Data Activity: Exploring Data Sets
Explore some of examples of big data and find at least two data sets that interest you. Some ideas of where to find data sets are below. Then, answer the following reflection questions in your portfolio.- What specifically were the types of data (text, sounds, transactions, etc.) included in the data set you chose?
- What new facts did you learn when exploring the data set? List at least 3 facts.
- Write a question you have about the data set you chose. Now, convert that question into a hypothesis (a statement) with your prediction about the data.
- Identify at least one security and/or privacy concern that is associated with the data in the data set you chose.
- If your data set included a visualization, explain the purpose of the visualization. How would you change or improve the visualization? If it did not include a visualization, describe one that you think would be useful in understanding the data.
- Wikipedia Article on Big Data
- Reddit maintains a Data is Beautiful site that has lots of visualizations of interesting data sets. Browse through that collection.
- These data sets allow you to create visualizations with different types of graphs to explore the data.
- Here's a nice visualization of student debt that was put together by the New York Times.
- This is a nice interactive visualization of how the Internet has grown and when various technologies have been introduced.
- NY Times How much warmer was your city in 2016? visualization
- NY Times Air Pollution in Cities visualization
7.2.3. Summary¶
In this lesson, you learned how to:
7.2.4. Self-Check¶
Sample AP CSP Exam Question
- Backing up data
- Not quite - According to the table, backing up data for a company with 100,000 would take over 2,000 hours (200 x 10). Even though that's a long time, there is another task that would take even longer.
- Deleting entries from data
- Nice try, but according to this table deleting entries for a company with approximately 100,000 customers would only take 400 hours.
- Searching through data
- Nice try, but the question is asking about 100,000 customers.
- Sorting data
- That is correct!
Q-10:
7.2.5. Reflection: For Your Portfolio¶
Answer the following portfolio reflection questions as directed by your instructor. Questions are also available in this Google Doc where you may use File/Make a Copy to make your own editable copy.