6.8. Case Study 2: Scraping Business Data Using Panda and BeautifulSoup¶
In this chapter, we will learn how to extract data from any source. Often, data cannot be obtained in a simple CSV format, and we need to extract it in other ways. Web scraping is one of these processes that allow us to extract data from different sources quickly and efficiently.
Below are some great articles that will help you better understand web scraping and the BeautifulSoup library. Make sure you read all three articles before you move on to section 6.4.
The data that we have been using so far was compiled and turned into a CSV file. Much of the data comes from The World Bank website, which offers the data in CSV format. However, some websites do not prove their data in the form of a nice and convenient CSV file, so we need to convert the data from its human-readable format (as a webpage) to a Pandas friendly format, a CSV file. We will use the data that comes from the CIA World Factbook in this chapter.
The goal of this exercise is to web scrape the CIA World Factbook and create country data for 2017.
Let’s start scraping the country data from 2017. At the end of this exercise, you will be able to scrap data from any year.
You can download each year of the factbook going back to the year 2000 from the CIA (click here). For this exercise, we will scrape the data from the year 2017. Once you have downloaded the data, you can unzip the file on your local computer.
The challenge of this project is that each variable, such as budget, GDP, inflation rate, etc., is on its page. So, we will have to combine data from many pages into a single coherent data frame. Then, when we have gathered all of the columns, we can pull them together into one nice data frame, and save that to a CSV file.
If you design a good function like the one in the previous case study to find and scrape one piece of information, you can also
make it work for all pieces of information. If you accomplish this, in the end, you will have a minimal amount of code that does a lot of work.
Therefore, try scraping one or two pages and when you have become comfortable scraping single pages. You can gather all the columns and URLs from the notesanddefs.html
file and loop through the URLs to go to each page and retrieve all the information you want.
Copy path from your file explorer. Here is an example of how it should look like.
C:\Users\mainuser\factbook
Lets take a look at the file structure of the downloaded data from 2017.
import os
files = os.listdir('C:\\Users\\mainuser\\factbook\\fields')
print(sorted(files)[:10])
['2001.html', '2002.html', '2003.html', '2004.html', '2006.html', '2007.html', '2008.html', '2010.html', '2011.html', '2012.html']
6.8.1. Getting a List of Business Fields¶
This may look intimidating to see, but there is a method behind this madness. For each numbered file, it contains one field that we can add to our data frame. Try to examine one carefully, and see if you can figure out a good marker that we can use to find the field contained in each.
Since you are investigating, if you stop and think, just like any other web page, there should be some nice, human-readable table of contents that can help us. Luckily,
there is one, and we can find it in the file, rankorderguide.html
.
For now, let’s start small and work our way up to the bigger picture. We can write
some code to scrape all the fields and the file they are in from the rankorderguide.html
file.
Each page contains different information about countries. We can scrape features such as Inflation rate (consumer prices), industrial production growth rate, etc., and the link to the page with all of the data for this feature for each country.
Before we start scraping the CIA World Factbook data, let’s refresh our memory on the HTML structure. Below is an excerpt from the HTML page that has information about the inflation rate. Let us carefully examine each tag and element in HTML so we can scrape the data more efficiently.
NOTE: You can view a page in its HTML format in any browser. For Google Chrome, right-click the page you want to see and click on view page source. If you are using another browser, you can always look up online on how to view the page in HTML.
html
<a name="2092"></a>
<div id="2092" name="2092">
<li style="list-style-type: none; line-height: 20px; padding-bottom: 3px;" >
<span style="padding: 2px; display:block; background-color:#F8f8e7;" class="category">
<table width="100%" border="0" cellpadding="0" cellspacing="0" >
<tr>
<td style="width: 90%;" >Inflation rate (consumer prices)</td><td align="right" valign="middle">
<a href="../fields/2092.html#119" title="Field info displayed for all countries in alpha order."> <img src="../graphics/field_listing_on.gif" border="0" style="padding:0px;" > </a>
</td>
</tr>
</table>
</span>
<div id="data" class="category_data" style="width: 98%; font-weight: normal; background-color: #fff; padding: 5px; margin-left: 0px; border-top: 1px solid #ccc;" >
<div class="category_data" style="text-transform:none">
This entry furnishes the annual percent change in consumer prices compared with the previous year's consumer prices.</div>
</div>
</li>
</div>
The <td>
is a tag that defines a cell in a table. <a>
is the tag that is used to link one web page
to another. You click on things defined by <a>
tags all the time. The part
href="../fields/2092.html#119
is a hyper-ref, that contains the URL of where
the link should take you.
The indentation in the code shows the hierarchical structure of an HTML document. Some very important things to note are that, blocks that are indented to the same level are siblings, and blocks that are nested inside other blocks have a parent-child relationship. We can take a look at examples of these relationships in the following diagram.
Let’s find a pattern that will help us find the two items that we are interested in. For instance, in the 2017 country
data, we see that each table we want is contained in a span
, and the span has the attribute class="category"
.
Keep in mind that this is not always the pattern for every webpage. For future web scraping, pay attention
to the particular pattern of a webpage and scrape accordingly.
We will use Python’s BeautifulSoup package to get the web page into a form that we can use some real power search tools.
First, let’s import the module and read the entire webpage as a string. In this exercise, since we downloaded
the data to our computer, we will use open()
to read the data. However, you can use requests
to read
data from online sources.
from bs4 import BeautifulSoup
page = open('../Data/factbook/2017/docs/notesanddefs.html').read()
page[:200]
NOTE: If you get an error opening the file, you can place r
before the URL, (r'../Data/factbook/2017/docs/notesanddefs.html')
. If you get a UnicodeDecodeError
, try
putting, encoding = 'utf-8'
. For example, Open(r'../Data/factbook/2017/docs/notesanddefs.html', encoding = 'utf-8')
.
'<!doctype html>n<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->n<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->n<!--[if IE 8]> <html c'
Now, let’s have BeautifulSoup take control.
page = BeautifulSoup(page)
print(page.prettify()[:1000])
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]><!-->
<!--<![endif]-->
<html class="no-js" lang="en">
<!-- InstanceBegin template="/Templates/wfbext_template.dwt.cfm" codeOutsideHTMLIsLocked="false" -->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<!-- InstanceBeginEditable name="doctitle" -->
<title>
The World Factbook
</title>
<!-- InstanceEndEditable -->
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<link href="../css/fullscreen-external.css" rel="stylesheet" type="text/css"/>
<script src="../js/modernizr-latest.js">
</script>
<!--developers version - switch to specific production http://modernizr.com/download/-->
<script src="../js/jquery-1.8.3.min.
We will use the search capabilities of BeautifulSoup to find all of the span
tags with the
class
“category”.
As you may remember, the search syntax allows us to:
Search for all matching tags
Search for all matching tags with a particular class
Search for some tag that has the given id
Search for classes that have a specific id
Search for all matching tags that are the children of some other tag
Many other things of a similar essence
The search syntax uses a couple of unique characters to indicate relationships or to identify classes and ids. Let’s review them.
.
is used to specify a class, so.category
finds all tags that have the attributeclass=category
.tag.class
makes that more specific and limits the results to just the particular tags that have that class. For example,span.category
will only select span tags withclass=category
.#
is used to specify an id, sodiv#2053
would only match a div tag with id=2053.#2053
would find any tag with id=2053. Note ids are meant to be unique within a web page, so#2053
should only find a single tag.`` `` indicates parent-child relationship, so
span table
would find all of the table tags that are children of a span, anddiv span table
would find all the tables that are children of a span that are children of a div.
Let’s use the select
method of BeautifulSoup object. In our case, we have created a BeautifulSoup
object called page
. select
will always return a list so that you can iterate
over the list or index into the list. Let’s try an example.
links = page.select('a')
print(len(links))
links[-1]
625
<a class="go-top" href="#">GO TOP</a>
So, this tells us that there are 625 a
tags on the page, and the last one
takes us to the top of the page.
Starting small, let’s print the column names. We will do this by creating a
list of all of the span
tags with the class category. As we iterate over
each of them, we can use select
to find the td
tags inside the span.
cols = page.select("span.category")
for col in cols:
cells = col.select('td')
col_name = cells[0].text
print(col_name)
Administrative divisions
Age structure
Agriculture - products
Airports
Airports - with paved runways
Airports - with unpaved runways
Area
Area - comparative
Background
Birth rate
Broadcast media
Budget
Next, let’s get the path file name using the same concept as the example above.
cols = page.select("span.category")
for col in cols:
cells = col.select('td')
colname = cells[0].text
links = cells[1].select('a')
if len(links) > 0:
fpath = links[0]['href']
print(colname, fpath)
Administrative divisions ../fields/2051.html#3
Age structure ../fields/2010.html#4
Agriculture - products ../fields/2052.html#5
Airports ../fields/2053.html#6
Airports - with paved runways ../fields/2030.html#7
Airports - with unpaved runways ../fields/2031.html#8
Area ../fields/2147.html#10
Area - comparative ../fields/2023.html#11
Background ../fields/2028.html#12
Birth rate ../fields/2054.html#13
Broadcast media ../fields/2213.html#14
Budget ../fields/2056.html#15
Budget surplus (+) or deficit (-) ../fields/2222.html#16
So, now we have the means to get the names and paths. Your task is now to create a DataFrame with as many of the business information that you can. You’ll have to do your investigation into the structure of the file to find a way to scrape the information.
Like mentioned earlier, we suggest starting by scraping one or two pages and get all the information from those pages. Then, when you are comfortable and make a function that gives you all the information; you can iterate through the URLs and scrape all the pages with minimal code.
6.8.2. Loading Business Data in Rough Form¶
Let’s get the data in the rough form; then, we can clean it up once we have it in a DataFrame.
There are 177 different fields in the 2017 data. Loading all of them would be a considerable amount of work, and more data than we need. Let’s start with a list that is close to our original data above.
Country - name
GDP - Real Growth Rate
Unemployment Rate
Inflation Rate
Budget
Tax and other revenues
Imports
Exports
Agriculture - Products
Feel free to add others if they interest you.
You can use the structure given below, and you can pass the dictionary that you created to the DataFrame constructor, and you should have something that looks like this.
all_data = {'field name' : {coutry_code : value} ...}
pd.DataFrame(all_data).head()
GDP - Real Growth Rate | Unemployment Rate | Inflation Rate | Budget | Tax and other revenues | Imports | Exports | Agriculture - Products | |
---|---|---|---|---|---|---|---|---|
Afghanistan | \n2.4% (2016 est.)\n1.3% (2015 est.)\n2.7% (20... | \n35% (2008 est.)\n40% (2005 est.)\n | \n4.4% (2016 est.)\n-2.9% (2015 est.)\n | \nrevenues: 1.992𝑏𝑖𝑙𝑙𝑖𝑜𝑛\nexpenditures: 6.6... | \n10.5% of GDP (2016 est.)\n | \n 6.16𝑏𝑖𝑙𝑙𝑖𝑜𝑛(2016𝑒𝑠𝑡.)\n 7.034 billion (2... | \n 619.2𝑚𝑖𝑙𝑙𝑖𝑜𝑛(2016𝑒𝑠𝑡.)\n 580 million (20... | \nopium, wheat, fruits, nuts; wool, mutton, sh... |
Albania | \n3.4% (2016 est.)\n2.2% (2015 est.)\n1.8% (20... | \n15.2% (2016 est.)\n13.3% (2015 est.)\nnote: ... | \n1.3% (2016 est.)\n1.9% (2015 est.)\n | \nrevenues: 3.279𝑏𝑖𝑙𝑙𝑖𝑜𝑛\nexpenditures: 3.4... | \n27% of GDP (2016 est.)\n | \n 3.671𝑏𝑖𝑙𝑙𝑖𝑜𝑛(2016𝑒𝑠𝑡.)\n 3.402 billion (... | \n 789.1𝑚𝑖𝑙𝑙𝑖𝑜𝑛(2016𝑒𝑠𝑡.)\n 854.7 million (... | \nwheat, corn, potatoes, vegetables, fruits, o... |
Algeria | \n3.3% (2016 est.)\n3.7% (2015 est.)\n3.8% (20... | \n10.5% (2016 est.)\n11.2% (2015 est.)\n | \n6.4% (2016 est.)\n4.8% (2015 est.)\n | \nrevenues: 45.37𝑏𝑖𝑙𝑙𝑖𝑜𝑛\nexpenditures: 67.... | \n28.2% of GDP (2016 est.)\n | \n 49.43𝑏𝑖𝑙𝑙𝑖𝑜𝑛(2016𝑒𝑠𝑡.)\n 52.65 billion (... | \n 29.06𝑏𝑖𝑙𝑙𝑖𝑜𝑛(2016𝑒𝑠𝑡.)\n 34.57 billion (... | \nwheat, barley, oats, grapes, olives, citrus,... |
American Samoa | \n-2.4% (2013 est.)\n-2.7% (2012 est.)\n0.6% (... | \n29.8% (2005)\n | \n2.1% (2013)\n3.5% (2012)\n | \nrevenues: 241.2𝑚𝑖𝑙𝑙𝑖𝑜𝑛\nexpenditures: 243... | \n32.2% of GDP (2013 est.)\n | \n 564𝑚𝑖𝑙𝑙𝑖𝑜𝑛(2013𝑒𝑠𝑡.)\n 508 million (2012)\n | \n 459𝑚𝑖𝑙𝑙𝑖𝑜𝑛(2013𝑒𝑠𝑡.)\n 489 million (2012)\n | \nbananas, coconuts, vegetables, taro, breadfr... |
Andorra | \n-1.1% (2015 est.)\n1.4% (2014 est.)\n-0.1% (... | \n3.7% (2016 est.)\n4.1% (2015 est.)\n | \n-0.9% (2015 est.)\n-0.1% (2014 est.)\n | \nrevenues: 1.872𝑏𝑖𝑙𝑙𝑖𝑜𝑛\nexpenditures: 2.0... | \n69% of GDP (2016)\n | \n 1.257𝑏𝑖𝑙𝑙𝑖𝑜𝑛(2015𝑒𝑠𝑡.)\n 1.264 billion (... | \n 78.71𝑚𝑖𝑙𝑙𝑖𝑜𝑛(2015𝑒𝑠𝑡.)\n 79.57 million (... | \nsmall quantities of rye, wheat, barley, oats... |
Now, we need a bit of cleanup!
You can use the documentation on the extract
method in Pandas to make the fields that
are not numeric more computer-digestible.
6.8.3. Cleaning Business Data¶
Now that the data is in a DataFrame, you can start cleaning it up. You can go through this tutorial. to learn how to use regular expression pattern matching.
6.8.4. Saving the Business Data¶
We can save the data using to_csv
.
6.8.5. Comparing Business Data Across the Years¶
We can do this process for past years, but you might have to change your code slightly as you go back in the years. As you go back and screen scrape previous years, you will see that we are at the mercy of the website designers. One minor change to the CSS class or the id element can mess up your code and strategy to screen scrape.
However, if you manage to scrape all 17 years of world factbook data, you will have achieved something special. There are a lot of people that can make use of this data in a more convenient format.
Lesson Feedback
-
During this lesson I was primarily in my...
- 1. Comfort Zone
- 2. Learning Zone
- 3. Panic Zone
-
Completing this lesson took...
- 1. Very little time
- 2. A reasonable amount of time
- 3. More time than is reasonable
-
Based on my own interests and needs, the things taught in this lesson...
- 1. Don't seem worth learning
- 2. May be worth learning
- 3. Are definitely worth learning
-
For me to master the things taught in this lesson feels...
- 1. Definitely within reach
- 2. Within reach if I try my hardest
- 3. Out of reach no matter how hard I try