5.3. Pandas Exercises¶
Before attempting this exercise, make sure you’ve read through the first four pages of Chapter 3 of the Python Data Science Handbook.
We’re going to be using a dataset about movies to try out processing some data with Pandas.
We start with some standard imports.
import pandas as pd
import numpy as np
We are providing you with data for this exercise that comes from the Movie Database. To create this lesson we used the TMDB API, but our book is not endorsed or certified by TMDB. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows.
Click here to download the movies dataset.
Then we load the data from a local file and checkout the data.
df = pd.read_csv('./Data/movies_metadata.csv').dropna(axis=1, how='all')
df.head()
belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | popularity | ... | release_date | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [{'id': 16, 'name': 'Animation'}, {'id': 35, '... | http://toystory.disney.com/toy-story | 862.0 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | 21.946943 | ... | 1995-10-30 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 |
1 | NaN | 65000000 | [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... | NaN | 8844.0 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | 17.015539 | ... | 1995-12-15 | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 |
2 | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... | NaN | 15602.0 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | 11.712900 | ... | 1995-12-22 | 0.0 | 101.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Still Yelling. Still Fighting. Still Ready for... | Grumpier Old Men | False | 6.5 | 92.0 |
3 | NaN | 16000000 | [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... | NaN | 31357.0 | tt0114885 | en | Waiting to Exhale | Cheated on, mistreated and stepped on, the wom... | 3.859495 | ... | 1995-12-22 | 81452156.0 | 127.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Friends are the people who let you be yourself... | Waiting to Exhale | False | 6.1 | 34.0 |
4 | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [{'id': 35, 'name': 'Comedy'}] | NaN | 11862.0 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | 8.387519 | ... | 1995-02-10 | 76578911.0 | 106.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Just When His World Is Back To Normal... He's ... | Father of the Bride Part II | False | 5.7 | 173.0 |
5 rows × 23 columns
5.3.1. Exploring the Data¶
This dataset was obtained from Kaggle who downloaded it through the TMDB API.
The movies available in this dataset are in correspondence with the movies that are listed in the MovieLens Latest Full Dataset.
Let’s see what data we have.
df.shape
(45453, 23)
Twenty-three columns of data for over 45,000 movies is going be a lot to look at, but let’s start by looking at what the columns represent.
df.columns
Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
'imdb_id', 'original_language', 'original_title', 'overview',
'popularity', 'poster_path', 'production_companies',
'production_countries', 'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline', 'title', 'video',
'vote_average', 'vote_count'],
dtype='object')
Here’s an explanation of each column.
belongs_to_collection: A stringified dictionary that identifies the collection that a movie belongs to (if any).
budget: The budget of the movie in dollars.
genres: A stringified list of dictionaries that list out all the genres associated with the movie.
homepage: The Official Homepage of the movie.
id: An arbitrary ID for the movie.
imdb_id: The IMDB ID of the movie.
original_language: The language in which the movie was filmed.
original_title: The title of the movie in its original language.
overview: A blurb of the movie.
popularity: The Popularity Score assigned by TMDB.
poster_path: The URL of the poster image (relative to http://image.TMDB.org/t/p/w185/).
production_companies: A stringified list of production companies involved with the making of the movie.
production_countries: A stringified list of countries where the movie was filmed or produced.
release_date: Theatrical release date of the movie.
revenue: World-wide revenue of the movie in dollars.
runtime: Duration of the movie in minutes.
spoken_languages: A stringified list of spoken languages in the film.
status: Released, To Be Released, Announced, etc.
tagline: The tagline of the movie.
title: The official title of the movie.
video: Indicates if there is a video present of the movie with TMDB.
vote_average: The average rating of the movie on TMDB.
vote_count: The number of votes by users, as counted by TMDB.