5.3. Pandas Exercises¶

Before attempting this exercise, make sure you’ve read through the first four pages of Chapter 3 of the Python Data Science Handbook.

We’re going to be using a dataset about movies to try out processing some data with Pandas.

We start with some standard imports.

import pandas as pd
import numpy as np

We are providing you with data for this exercise that comes from the Movie Database. To create this lesson we used the TMDB API, but our book is not endorsed or certified by TMDB. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows.

Click here to download the movies dataset.

Then we load the data from a local file and checkout the data.

df = pd.read_csv('./Data/movies_metadata.csv').dropna(axis=1, how='all')
df.head()

	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	popularity	...	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862.0	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.946943	...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1	NaN	65000000	[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...	NaN	8844.0	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	17.015539	...	1995-12-15	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0
2	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...	NaN	15602.0	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	11.712900	...	1995-12-22	0.0	101.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Still Yelling. Still Fighting. Still Ready for...	Grumpier Old Men	False	6.5	92.0
3	NaN	16000000	[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...	NaN	31357.0	tt0114885	en	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	3.859495	...	1995-12-22	81452156.0	127.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Friends are the people who let you be yourself...	Waiting to Exhale	False	6.1	34.0
4	{'id': 96871, 'name': 'Father of the Bride Col...	0	[{'id': 35, 'name': 'Comedy'}]	NaN	11862.0	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	8.387519	...	1995-02-10	76578911.0	106.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Just When His World Is Back To Normal... He's ...	Father of the Bride Part II	False	5.7	173.0

5 rows × 23 columns

5.3.1. Exploring the Data¶

This dataset was obtained from Kaggle who downloaded it through the TMDB API.

The movies available in this dataset are in correspondence with the movies that are listed in the MovieLens Latest Full Dataset.

Let’s see what data we have.

df.shape

(45453, 23)

Twenty-three columns of data for over 45,000 movies is going be a lot to look at, but let’s start by looking at what the columns represent.

df.columns

Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

Here’s an explanation of each column.

belongs_to_collection: A stringified dictionary that identifies the collection that a movie belongs to (if any).
budget: The budget of the movie in dollars.
genres: A stringified list of dictionaries that list out all the genres associated with the movie.
homepage: The Official Homepage of the movie.
id: An arbitrary ID for the movie.
imdb_id: The IMDB ID of the movie.
original_language: The language in which the movie was filmed.
original_title: The title of the movie in its original language.
overview: A blurb of the movie.
popularity: The Popularity Score assigned by TMDB.
poster_path: The URL of the poster image (relative to http://image.TMDB.org/t/p/w185/).
production_companies: A stringified list of production companies involved with the making of the movie.
production_countries: A stringified list of countries where the movie was filmed or produced.
release_date: Theatrical release date of the movie.
revenue: World-wide revenue of the movie in dollars.
runtime: Duration of the movie in minutes.
spoken_languages: A stringified list of spoken languages in the film.
status: Released, To Be Released, Announced, etc.
tagline: The tagline of the movie.
title: The official title of the movie.
video: Indicates if there is a video present of the movie with TMDB.
vote_average: The average rating of the movie on TMDB.
vote_count: The number of votes by users, as counted by TMDB.

You have attempted 1 of 1 activities on this page