8.5. Working with Text¶

Since we are working with data frames, sometimes when extracting text, blank spaces or trailing characters come with the data. That is not good, so we have to clean it up. Therefore, in this section, we will look at how we can clean up the text.

The Series and index objects in Pandas each have a set of string processing methods that make all of the standard Python string methods more available to work on all of the string elements in a Series. We call these “vectorized string methods”, because Pandas is designed to allow these operations to happen in parallel on all the rows of the data frame simultaneously, if you have the computing power. These are accessed through an intermediate object called str. For example, suppose we wanted to convert all of our three-letter country codes to lowercase.

undf.country.str.lower()

The code above does the job, and is over 700 times faster than using a for loop.

Here is a complete list of the string functions that the str object knows. Most of them should be very familiar to you.

len()
lower()
translate() - This one is a bit complicated, see here.
islower()
ljust()
upper()
startswith()
isupper()
rjust()
find()
endswith()
isnumeric()
center()
rfind()
isalnum()
isdecimal()
zfill() - To add leading zeros to strings.
index()
isalpha()
split()
strip()
rindex()
isdigit()
rsplit()
rstrip()
capitalize()
isspace()
partition() - To split each string into 3 parts the string before the separator, the separator, the string after the separator. This returns a data frame with three columns. For example, if your string is “Peter, Paul, and Mary”, then partitioning it on “and” would return [“Peter, Paul”, “and”, “Mary”].
lstrip()
swapcase()
istitle()
rpartition() - This is the same as partition, but looks for the separator from right to left instead of left to right.

Below are some regular expression methods for strings.

match() - Returns True/False if the string matches.
extract()
extractall()
findall()
replace()
contains()
count()
split()
rsplit()

We can use our new skills to do a bit of minor cleanup on the text. Many of the speeches start with an invisible non-breaking space character followed by a newline. (You will see it as \n in the text.) We can eliminate this with the following piece of code.

undf['text'] = undf.text.str.replace('\ufeff','') # remove strange character
undf['text'] = undf.text.str.strip() # eliminate whitespace from beginning and end

8.5.1. Research Questions¶

What is the average word count per speech?
How does that average compare across all of the countries?
What is the average sentence length per speech?
Find or create a list of topics that the UN might discuss and debate. Make a graph to show how often these topics were mentioned. For example: ‘peace’, ‘nuclear war’, ‘terrorism’, ‘moon landing’. You can think of your own!
The five permanent members of the UN security council are sec_council = [‘USA’, ‘RUS’, ‘GBR’, ‘FRA’, ‘CHN’]. Make a graph of the frequency of topics and how often they are discussed by those countries. You could do this same exercise with any group of countries. Maybe the central European, or North African, etc.
Make a graph to show the frequency with which various topics are discussed over the years. For example, ‘peace’ is consistently a popular word as is ‘freedom’ and ‘human rights’. What about ‘HIV’ or ‘terrorism’ or ‘global warming’. Compare two phrases like ‘global warming’ and ‘climate change’.
When did the internet become a popular topic?

8.5.2. Text Complexity¶

For years, people have been trying to find measures of text complexity, sometimes to determine what ‘reading level’ an article is at, or how much formal education is required to understand an piece of writing. These measures are often functions of things such as the number of sentences in a paragraph, sentence length, word length, number of polysyllabic words used, etc.

There are several Python packages that automatically compute the complexity for you, so that you don’t have to write that part yourself. One easy to use package is called textatistic. It calculates several different common measures of text complexity.

Using the Gunning Fog or smog index, compute the reading complexity for each speech.
Is there any correlation between the Fog index for a country and the GDP or literacy rate?
Make a graph showing the distribution of each of the above measures.

Lesson Feedback

1. Comfort Zone
2. Learning Zone
3. Panic Zone

1. Very little time
2. A reasonable amount of time
3. More time than is reasonable

1. Don't seem worth learning
2. May be worth learning
3. Are definitely worth learning

1. Definitely within reach
2. Within reach if I try my hardest
3. Out of reach no matter how hard I try

You have attempted of activities on this page