Predicting IMDB Top 250 movie ratings

16 Jan 2017

Introduction

This week, I will explore ratings for the Top 250 movies of all time on IMDB. The goal is to see what characteristics of these movies correlate to their IMDB ratings. I will solely explore tree-based ML techniques in this post (decision trees and ensemble methods).

Objective

There are 2 main goals of this study:

Build tree-based ML models to predict ratings, and assess their performance
Understand which features affect ratings for the best model (among the tree-based models)

Executive Summary

An optimized Random Forest model performs best based on the chosen features, with MSE of 0.028 on the test set
Number of votes, and movie runtime are the 2 most important predictors for ratings
The optimized Random Forest is overfitting (with the chosen set of features), and the model error could be further reduced by training the model on a larger dataset

Methodology

Here’s the rough order of steps I followed in arriving at answers to these questions:

Data scraping
Data wrangling and feature engineering
Visualization (data exploration)
Model building and optimization

(1) Data scraping

A table of the Top 250 movies was obtained directly from IMDB (http://www.imdb.com/chart/top) using Pandas. The movie names and year were extracted from the table using Regular Expressions, and the characteristics of each movie (runtime, genre, number of Oscars etc.) were obtained using the OMDB API (http://www.omdbapi.com/). Here’s a snippet of the code used for this:

# Get Top 250 movie table
df_aslist = (pd.read_html(url, attrs = {'class': 'chart full-width'}))
df = df_aslist[0]
# Obtain title for API search
name1 = [re.sub(r"[0-9]+\.", "", s) for s in df["Rank & Title"]]
name2 = [re.sub(r"\([0-9]+\)", "", s) for s in name1]
title = [s.strip() for s in name2]
# Movie details from OMDB API
response = []
for i, t in enumerate(title):
    response.append(requests.get("http://www.omdbapi.com/?t="+t+"&type=movie&y="+year[i]).text)
data = "[" + (",".join(response)) + "]"
movie = pd.read_json(data, orient = "records")

Here’s a snapshot of the dataframe available after this step (not all columns displayed).

Actors Awards Genre Runtime Title Year imdbRating imdbVotes

Tim Robbins, Morgan Freeman, Bob Gunton, Willi… Nominated for 7 Oscars. Another 19 wins & 30 n… Crime, Drama 142 min The Shawshank Redemption 1994.0 9.3 1,754,270

Actors	Awards	Genre	Runtime	Title	Year	imdbRating	imdbVotes
Tim Robbins, Morgan Freeman, Bob Gunton, Willi…	Nominated for 7 Oscars. Another 19 wins & 30 n…	Crime, Drama	142 min	The Shawshank Redemption	1994.0	9.3	1,754,270

(2) Data wrangling and feature engineering

There were 2 main aspects to this section:

Dealing with null and “N/A” values
Extracting features like number of Oscars won or nominated for, movie runtime, number of votes, genres, languages and decade of release

All predictors were converted to numerics if needed, because decision trees as implemented in Python’s scikit-learn library only accept numerics as predictors. I hypothesized that other available features like director, actor, writer and plot keywords would not be direct predictors of ratings.

Here’s a code snippet used for extracting the number of Oscars:

# Extract number of Oscars (Nominarted for *OR* Won)
movie["oscar n/w"] = 0
for index, row in movie.iterrows():
    award_s = str(row.ix["Awards"]).split()

    if (award_s[0] == "Nominated") and ((award_s[3] == "Oscar.") or (award_s[3] == "Oscars.")):
        movie["oscar n/w"].loc[index] = int(award_s[2])
    elif award_s[0] == "Won" and ((award_s[2] == "Oscar.") or (award_s[2] == "Oscars.")):
        movie["oscar n/w"].loc[index] = int(award_s[1])
    else:
        movie["oscar n/w"].loc[index] = 0

And here’s a code snippet for extracting movie genres:

# Get genres using NLP
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

cvec = CountVectorizer()

df2_genre = pd.DataFrame(data = cvec.fit_transform(df2["Genre"]).todense(),
                        columns = cvec.get_feature_names())
# Merge "sci" and "fi", "Film" and "Noir"
df2_genre["sci-fi"] = df2_genre["sci"]
del df2_genre["sci"]
del df2_genre["fi"]
df2_genre["film-noir"] = df2_genre["noir"]
del df2_genre["film"]
del df2_genre["noir"]

(3) Data exploration

The relationship of a number of predictors to the ratings was explored. Here are some observations:

The ratings are skewed positive, and they are within a very tight range. It may be desirable to add more movies with ratings outside this range to better train the model

Ratings seem to be higher on average for movie that had more votes cast for them. It’s not clear though which one is the predictor in this case (ratings, or votes), if indeed there’s a causal relationship between the two

There seems to be an optimum runtime for movies with the highest ratings

There seems to be no correlation of ratings with number of Oscars (nominated or won)

There does not seem to be any correlation between ratings and the decade when the movie was released

(4) Model building and optimization

I built both base and optimized (GridSearchCV) regressor models using decision tree, Random Forest and Gradient Boosting. Here are code snippets to fit and evaluate the performance of all base and optimized models:

# Function for all base models
def base_model(model, name, X_train, X_test, y_train, y_test):
    "Base model characteristics"
# CV scores
    scores = cross_val_score(model, X_train, y_train, cv = 5)
    print(name, "CV Score: ", scores.mean(), "+/-", scores.std())
# Feature importances    
    model.fit(X_train, y_train)
    df = pd.DataFrame(model.feature_importances_, index = X_train.columns, columns=['importance']).\
                        sort_values('importance', ascending=False)
    print(df[:5])
# Predictions
    y_pred = model.predict(X_test)
    print(name, "predicted MSE: ", mean_squared_error(y_test, y_pred))
    print(name, "predicted R2: ", r2_score(y_test, y_pred))
# Plot
    make_plot(y_test, y_pred, name)

# Function for all grid search models
def gs_model(model, name, X_train, X_test, y_train, y_test):
    "GridSearch model characteristics"
# Execute GridSearch
    model.fit(X_train, y_train)
    print("Best score: ", model.best_score_)
    print("Best params: ", model.best_params_)
# Predicted values
    y_pred = model.predict(X_test)
    print(name, "Predicted MSE: ", mean_squared_error(y_test, y_pred))
    print(name, "Predicted R2: ", r2_score(y_test, y_pred))
    make_plot(y_test, y_pred, name)

The following charts show the performance of these models. It’s clear that an optimized base decision tree does not do well in predicting ratings, but the ensemble techniques (Random Forest and Gradient Boosting) greatly improve performance.

The MSE and R2 for all these models is shown below. The optimized Random Forest performs the best in this group.

Method MSE R2 (predicted vs actual)

Decision Tree 0.043 -0.013

GS Decision Tree 0.041 0.048

Random Forest 0.031 0.28

GS Random Forest 0.028 0.34

ExtraTrees 0.031 0.28

Gradient Boosting 0.029 0.32

GS Gradient Boosting 0.029 0.32

Method	MSE	R2 (predicted vs actual)
Decision Tree	0.043	-0.013
GS Decision Tree	0.041	0.048
Random Forest	0.031	0.28
GS Random Forest	0.028	0.34
ExtraTrees	0.031	0.28
Gradient Boosting	0.029	0.32
GS Gradient Boosting	0.029	0.32

The feature importances from the Random Forest model show that votes and runtime are the top 2 features for predicting ratings, as observed during data exploration.

feature importance

votes 0.611595

time 0.072505

1950s 0.033383

english 0.027293

oscar n/w 0.026068

feature	importance
votes	0.611595
time	0.072505
1950s	0.033383
english	0.027293
oscar n/w	0.026068

Next steps

While it’s clear that the ensemble models are performing better than the base decision tree models, the MSE and R2 still seem high and low respectively for the best model. There are a couple of options at this stage: either collect more training data (if model suffers from a variance problem), or increase model complexity (if model suffers from a bias problem). A third option may be to engineer a new set of features if neither of these is feasible.

A learning curve on the optimized Random Forest seems to suggest that while training error is low, CV error is still high, indicating a potential variance problem with the current model. Additional training data, if available, would help in this case.